Skip to main navigation Skip to search Skip to main content

Struct2Unstruct: Creating Tender NER Datasets from Structured Procurement Records using Large Language Models

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Named Entity Recognition (NER) in the tender and procurement domain is critical for tasks such as contract monitoring, supplier analysis, and compliance tracking. However, unlike general-purpose NER, no open-source datasets exist for Tender NER, largely due to data sensitivity and confidentiality restrictions. This scarcity limits the development of automated entity extraction models.
To address this gap, we propose struct2unstruct, a data preparation pipeline that generates and annotates tender-specific datasets using large language models (LLMs). Starting from structured procurement data published by the Singapore government (2015–2021) available in English language, we employ Llama-3 to generate synthetic tender narratives in multiple writing styles, ensuring each contains at least one tender-related entity. Post-processing steps correct inconsistencies in dates, symbols, and entity formats. Entities are then annotated using a BIO tagging scheme through deterministic alignment with structured fields, followed by expert validation to ensure accuracy.
This study focuses on data preparation and evaluation, not model training. The resulting dataset provides a scalable resource for future Tender NER research in low-resource environments. By releasing both the dataset and pipeline as open-source resources, we establish a foundation for advancing domain-adapted information extraction and automated tender entity recognition.
Original languageEnglish
Title of host publicationProceedings of the fourth international workshop on the role of resources in the age of large language models (RESOURCEFUL-2026)
Number of pages8
Publication statusAccepted/In press - 26 Mar 2026
EventFourth international workshop on the role of resources in the age of large language models - Palma, Spain
Duration: 11 May 202611 May 2026

Publication series

NameNEALT Proceedings Series
ISSN (Print)1736-8197
ISSN (Electronic)1736-6305

Workshop

WorkshopFourth international workshop on the role of resources in the age of large language models
Abbreviated titleRESOURCEFUL 2026
Country/TerritorySpain
CityPalma
Period11/05/2611/05/26

Bibliographical note

Not yet published as of 27/04/2026.

Keywords

  • Named Entities Recognition
  • data augmentation
  • Large Language Model (LLM)
  • Data Preparation

Fingerprint

Dive into the research topics of 'Struct2Unstruct: Creating Tender NER Datasets from Structured Procurement Records using Large Language Models'. Together they form a unique fingerprint.

Cite this