TY - JOUR
T1 - Benchmarking transformer-based models for medical record de-identification in a single center multi-specialty evaluation
AU - Kuo, Rachel
AU - Soltan, Andrew A.S.
AU - O'Hanlon, Ciaran
AU - Hasanic, Alan
AU - Clifton, David A.
AU - Collins, Gary
AU - Furniss, Dominic
AU - Eyre, David W.
PY - 2025/12/19
Y1 - 2025/12/19
N2 - Protecting patient confidentiality is central to enabling research using electronic health records. Automated text de-identification offers a scalable alternative to manual redaction. However, different approaches vary in accuracy and adaptability. We evaluated four transformer-based, task-specific models and five large language models on 3,650 clinical records spanning general and specialty datasets from a UK hospital group. Records were dual-annotated by clinicians, allowing precise comparison of performance. The Microsoft Azure de-identification service achieved the highest F1 score, approaching clinician performance, while fine-tuned AnonCAT and GPT-4-0125 with few-shot prompting also performed strongly. Smaller LLMs frequently over-redacted or produced hallucinatory content, limiting interpretability. Task-specific models demonstrated greater stability across datasets, while low-level adaptation improved performance in both model classes. These findings highlight that automated de-identification systems can provide effective support for large-scale sharing of clinical records, but success depends on careful model choice, adaptation strategies, and safeguards to ensure robust data utility and privacy.
AB - Protecting patient confidentiality is central to enabling research using electronic health records. Automated text de-identification offers a scalable alternative to manual redaction. However, different approaches vary in accuracy and adaptability. We evaluated four transformer-based, task-specific models and five large language models on 3,650 clinical records spanning general and specialty datasets from a UK hospital group. Records were dual-annotated by clinicians, allowing precise comparison of performance. The Microsoft Azure de-identification service achieved the highest F1 score, approaching clinician performance, while fine-tuned AnonCAT and GPT-4-0125 with few-shot prompting also performed strongly. Smaller LLMs frequently over-redacted or produced hallucinatory content, limiting interpretability. Task-specific models demonstrated greater stability across datasets, while low-level adaptation improved performance in both model classes. These findings highlight that automated de-identification systems can provide effective support for large-scale sharing of clinical records, but success depends on careful model choice, adaptation strategies, and safeguards to ensure robust data utility and privacy.
KW - Artificial intelligence
KW - Health informatics
UR - https://www.scopus.com/pages/publications/105023189679
U2 - 10.1016/j.isci.2025.113732
DO - 10.1016/j.isci.2025.113732
M3 - Article
AN - SCOPUS:105023189679
SN - 2589-0042
VL - 28
JO - iScience
JF - iScience
IS - 12
M1 - 113732
ER -