Abstract
Background: Large language models (LLMs) have seen rapid adoption, with over 100 million users engaging with ChatGPT within two months of its 2022 launch. Despite their potential, LLMs have struggled with clinical reasoning tasks. Recent advancements in 2024 introduced more powerful models like OpenAI’s o1-preview, optimized for reasoning through reinforcement learning. This study examines whether the latest LLMs have significantly improved their ability to apply clinical reasoning using up-to-date clinical guidelines.
Methods: We developed 201 clinical scenarios based on 43 guidelines from National Institute for Health and Care Excellence (NICE), covering conditions such as coronary artery disease, heart failure, chronic kidney disease, hypertension, diabetes, and more. Seventeen LLMs from OpenAI, Anthropic, and Google were evaluated. Prompt engineering techniques were employed, including chain-of-thought (CoT), multi-shot learning, and retrieval-augmented generation (RAG). LLMs were instructed to recommend medications for each clinical scenario strictly according to NICE guidelines, with outputs assessed using an automated system. LLMs generated a total of 118,992 medication recommendations.
Results: Models released in 2024 demonstrated significant improvements over their predecessors. OpenAI’s o1-preview achieved the highest accuracy with an F1 score of 73.0%, which was a step change vs. the highest performing model from 2023 (GPT-4 Turbo: F1 score 45.7%, P < 0.001) and 2022 (GPT-3.5 Turbo: F1 score 32.1%, P < 0.001). Providing the most recent guideline recommendations in real-time with RAG enhanced performance of all the most recent LLMs. However, even top-performing LLMs showed decreased accuracy in multi-morbidity and occasionally suggested dangerous treatments.
Conclusion: In conclusion, there has been a step change in the ability of AI to reason using clinical guidelines in 2024. Nonetheless, challenges persist, particularly in managing complex cases with multiple comorbidities and guard rails are still needed to prevent inappropriate recommendations. Continued advancements and implementation of safety measures are essential for reliable clinical application.
[Figure presented]
[Figure presented]
Methods: We developed 201 clinical scenarios based on 43 guidelines from National Institute for Health and Care Excellence (NICE), covering conditions such as coronary artery disease, heart failure, chronic kidney disease, hypertension, diabetes, and more. Seventeen LLMs from OpenAI, Anthropic, and Google were evaluated. Prompt engineering techniques were employed, including chain-of-thought (CoT), multi-shot learning, and retrieval-augmented generation (RAG). LLMs were instructed to recommend medications for each clinical scenario strictly according to NICE guidelines, with outputs assessed using an automated system. LLMs generated a total of 118,992 medication recommendations.
Results: Models released in 2024 demonstrated significant improvements over their predecessors. OpenAI’s o1-preview achieved the highest accuracy with an F1 score of 73.0%, which was a step change vs. the highest performing model from 2023 (GPT-4 Turbo: F1 score 45.7%, P < 0.001) and 2022 (GPT-3.5 Turbo: F1 score 32.1%, P < 0.001). Providing the most recent guideline recommendations in real-time with RAG enhanced performance of all the most recent LLMs. However, even top-performing LLMs showed decreased accuracy in multi-morbidity and occasionally suggested dangerous treatments.
Conclusion: In conclusion, there has been a step change in the ability of AI to reason using clinical guidelines in 2024. Nonetheless, challenges persist, particularly in managing complex cases with multiple comorbidities and guard rails are still needed to prevent inappropriate recommendations. Continued advancements and implementation of safety measures are essential for reliable clinical application.
[Figure presented]
[Figure presented]
| Original language | English |
|---|---|
| Article number | ztaf143.056 |
| Number of pages | 3 |
| Journal | European Heart Journal - Digital Health |
| Volume | 7 |
| Issue number | Supplement_1 |
| DOIs | |
| Publication status | Published - 12 Jan 2026 |
| Event | European Society of Cardiology - Digital & AI Summit 2025 - Berlin, Germany Duration: 21 Nov 2025 → 22 Nov 2025 https://esc365.escardio.org/ESC-Digital-AI-Summit |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Fingerprint
Dive into the research topics of 'Rapid improvement in ability of AI to reason using clinical guidelines'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver