Rapid improvement in ability of AI to reason using clinical guidelines

  • S. Khattak
  • , J. T. Townend
  • , N. K. Khan
  • , M. T. Thomas

Research output: Contribution to journalAbstractpeer-review

Abstract

Background: Large language models (LLMs) have seen rapid adoption, with over 100 million users engaging with ChatGPT within two months of its 2022 launch. Despite their potential, LLMs have struggled with clinical reasoning tasks. Recent advancements in 2024 introduced more powerful models like OpenAI’s o1-preview, optimized for reasoning through reinforcement learning. This study examines whether the latest LLMs have significantly improved their ability to apply clinical reasoning using up-to-date clinical guidelines.

Methods
: We developed 201 clinical scenarios based on 43 guidelines from National Institute for Health and Care Excellence (NICE), covering conditions such as coronary artery disease, heart failure, chronic kidney disease, hypertension, diabetes, and more. Seventeen LLMs from OpenAI, Anthropic, and Google were evaluated. Prompt engineering techniques were employed, including chain-of-thought (CoT), multi-shot learning, and retrieval-augmented generation (RAG). LLMs were instructed to recommend medications for each clinical scenario strictly according to NICE guidelines, with outputs assessed using an automated system. LLMs generated a total of 118,992 medication recommendations.

Results
: Models released in 2024 demonstrated significant improvements over their predecessors. OpenAI’s o1-preview achieved the highest accuracy with an F1 score of 73.0%, which was a step change vs. the highest performing model from 2023 (GPT-4 Turbo: F1 score 45.7%, P < 0.001) and 2022 (GPT-3.5 Turbo: F1 score 32.1%, P < 0.001). Providing the most recent guideline recommendations in real-time with RAG enhanced performance of all the most recent LLMs. However, even top-performing LLMs showed decreased accuracy in multi-morbidity and occasionally suggested dangerous treatments.

Conclusion
: In conclusion, there has been a step change in the ability of AI to reason using clinical guidelines in 2024. Nonetheless, challenges persist, particularly in managing complex cases with multiple comorbidities and guard rails are still needed to prevent inappropriate recommendations. Continued advancements and implementation of safety measures are essential for reliable clinical application.

[Figure presented]

[Figure presented]
Original languageEnglish
Article numberztaf143.056
Number of pages3
JournalEuropean Heart Journal - Digital Health
Volume7
Issue numberSupplement_1
DOIs
Publication statusPublished - 12 Jan 2026
EventEuropean Society of Cardiology - Digital & AI Summit 2025 - Berlin, Germany
Duration: 21 Nov 202522 Nov 2025
https://esc365.escardio.org/ESC-Digital-AI-Summit

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Rapid improvement in ability of AI to reason using clinical guidelines'. Together they form a unique fingerprint.

Cite this