Abstract
Named Entity Recognition (NER) in the tender and procurement domain is critical for tasks such as contract monitoring, supplier analysis, and compliance tracking. However, unlike general-purpose NER, no open-source datasets exist for Tender NER, largely due to data sensitivity and confidentiality restrictions. This scarcity limits the development of automated entity extraction models.
To address this gap, we propose struct2unstruct, a data preparation pipeline that generates and annotates tender-specific datasets using large language models (LLMs). Starting from structured procurement data published by the Singapore government (2015–2021) available in English language, we employ Llama-3 to generate synthetic tender narratives in multiple writing styles, ensuring each contains at least one tender-related entity. Post-processing steps correct inconsistencies in dates, symbols, and entity formats. Entities are then annotated using a BIO tagging scheme through deterministic alignment with structured fields, followed by expert validation to ensure accuracy.
This study focuses on data preparation and evaluation, not model training. The resulting dataset provides a scalable resource for future Tender NER research in low-resource environments. By releasing both the dataset and pipeline as open-source resources, we establish a foundation for advancing domain-adapted information extraction and automated tender entity recognition.
To address this gap, we propose struct2unstruct, a data preparation pipeline that generates and annotates tender-specific datasets using large language models (LLMs). Starting from structured procurement data published by the Singapore government (2015–2021) available in English language, we employ Llama-3 to generate synthetic tender narratives in multiple writing styles, ensuring each contains at least one tender-related entity. Post-processing steps correct inconsistencies in dates, symbols, and entity formats. Entities are then annotated using a BIO tagging scheme through deterministic alignment with structured fields, followed by expert validation to ensure accuracy.
This study focuses on data preparation and evaluation, not model training. The resulting dataset provides a scalable resource for future Tender NER research in low-resource environments. By releasing both the dataset and pipeline as open-source resources, we establish a foundation for advancing domain-adapted information extraction and automated tender entity recognition.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the fourth international workshop on the role of resources in the age of large language models (RESOURCEFUL-2026) |
| Number of pages | 8 |
| Publication status | Accepted/In press - 26 Mar 2026 |
| Event | Fourth international workshop on the role of resources in the age of large language models - Palma, Spain Duration: 11 May 2026 → 11 May 2026 |
Publication series
| Name | NEALT Proceedings Series |
|---|---|
| ISSN (Print) | 1736-8197 |
| ISSN (Electronic) | 1736-6305 |
Workshop
| Workshop | Fourth international workshop on the role of resources in the age of large language models |
|---|---|
| Abbreviated title | RESOURCEFUL 2026 |
| Country/Territory | Spain |
| City | Palma |
| Period | 11/05/26 → 11/05/26 |
Bibliographical note
Not yet published as of 27/04/2026.Keywords
- Named Entities Recognition
- data augmentation
- Large Language Model (LLM)
- Data Preparation
Fingerprint
Dive into the research topics of 'Struct2Unstruct: Creating Tender NER Datasets from Structured Procurement Records using Large Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver