Clinical prediction models and the multiverse of madness

Richard D Riley; Alexander Pate; Paula Dhiman; Lucinda Archer; Glen P Martin; Gary S Collins

doi:10.1186/s12916-023-03212-y

Clinical prediction models and the multiverse of madness

Richard D Riley^*, Alexander Pate, Paula Dhiman, Lucinda Archer, Glen P Martin, Gary S Collins

^*Corresponding author for this work

Applied Health Research

Research output: Contribution to journal › Article › peer-review

7 Downloads (Pure)

Abstract

BACKGROUND: Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable for use in clinical practice.

MAIN BODY: We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it-were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual's predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual's prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice.

CONCLUSIONS: Instability is concerning as an individual's predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.

Original language	English
Article number	502
Journal	BMC medicine
Volume	21
Issue number	1
DOIs	https://doi.org/10.1186/s12916-023-03212-y
Publication status	Published - 18 Dec 2023

Bibliographical note

Funding
This paper presents independent research supported (for RDR, PD, GSC) by an EPSRC grant for ‘Artificial intelligence innovation to accelerate health research’ (number: EP/Y018516/1); (for RR, LA and GSC) by an NIHR-MRC Better Methods Better Research grant (MR/V038168/1); and (for RDR and LA) by the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. GSC is supported by Cancer Research UK (programme grant: C49297/A27294). PD is supported by Cancer Research UK (project grant: PRCPJT-Nov21\100021). RDR, GPM and AP are also supported by funding from an MRC-NIHR Methodology Research Programme grant (number: MR/T025085/1).

© 2023. The Author(s).

Keywords

Humans
Prognosis
Models, Statistical
Reproducibility of Results

Access to Document

10.1186/s12916-023-03212-yLicence: Creative Commons: Attribution (CC BY)

RileyRD2023Clinical Final published version, 1.62 MBLicence: Creative Commons: Attribution (CC BY)

Sample Size guidance for developing and validating reliable and fair AI PREDICTion models in healthcare (SS-PREDICT)
Cazier, J., Riley, R., Snell, K., Archer, L., Nirantharakumar, K., Cazier, J., Ensor, J. & Denniston, A.
UK Research and Innovation
2/10/23 → 1/04/25
Project: Research Councils

Cite this

@article{83baa84198d24bed9306d1da66e8e377,

title = "Clinical prediction models and the multiverse of madness",

abstract = "BACKGROUND: Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable for use in clinical practice.MAIN BODY: We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it-were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual's predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual's prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice.CONCLUSIONS: Instability is concerning as an individual's predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.",

keywords = "Humans, Prognosis, Models, Statistical, Reproducibility of Results",

author = "Riley, {Richard D} and Alexander Pate and Paula Dhiman and Lucinda Archer and Martin, {Glen P} and Collins, {Gary S}",

note = "Funding This paper presents independent research supported (for RDR, PD, GSC) by an EPSRC grant for {\textquoteleft}Artificial intelligence innovation to accelerate health research{\textquoteright} (number: EP/Y018516/1); (for RR, LA and GSC) by an NIHR-MRC Better Methods Better Research grant (MR/V038168/1); and (for RDR and LA) by the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. GSC is supported by Cancer Research UK (programme grant: C49297/A27294). PD is supported by Cancer Research UK (project grant: PRCPJT-Nov21\100021). RDR, GPM and AP are also supported by funding from an MRC-NIHR Methodology Research Programme grant (number: MR/T025085/1). {\textcopyright} 2023. The Author(s).",

year = "2023",

month = dec,

day = "18",

doi = "10.1186/s12916-023-03212-y",

language = "English",

volume = "21",

journal = "BMC medicine",

issn = "1741-7015",

publisher = "Springer",

number = "1",

}

TY - JOUR

T1 - Clinical prediction models and the multiverse of madness

AU - Riley, Richard D

AU - Pate, Alexander

AU - Dhiman, Paula

AU - Archer, Lucinda

AU - Martin, Glen P

AU - Collins, Gary S

N1 - Funding This paper presents independent research supported (for RDR, PD, GSC) by an EPSRC grant for ‘Artificial intelligence innovation to accelerate health research’ (number: EP/Y018516/1); (for RR, LA and GSC) by an NIHR-MRC Better Methods Better Research grant (MR/V038168/1); and (for RDR and LA) by the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. GSC is supported by Cancer Research UK (programme grant: C49297/A27294). PD is supported by Cancer Research UK (project grant: PRCPJT-Nov21\100021). RDR, GPM and AP are also supported by funding from an MRC-NIHR Methodology Research Programme grant (number: MR/T025085/1). © 2023. The Author(s).

PY - 2023/12/18

Y1 - 2023/12/18

N2 - BACKGROUND: Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable for use in clinical practice.MAIN BODY: We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it-were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual's predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual's prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice.CONCLUSIONS: Instability is concerning as an individual's predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.

AB - BACKGROUND: Each year, thousands of clinical prediction models are developed to make predictions (e.g. estimated risk) to inform individual diagnosis and prognosis in healthcare. However, most are not reliable for use in clinical practice.MAIN BODY: We discuss how the creation of a prediction model (e.g. using regression or machine learning methods) is dependent on the sample and size of data used to develop it-were a different sample of the same size used from the same overarching population, the developed model could be very different even when the same model development methods are used. In other words, for each model created, there exists a multiverse of other potential models for that sample size and, crucially, an individual's predicted value (e.g. estimated risk) may vary greatly across this multiverse. The more an individual's prediction varies across the multiverse, the greater the instability. We show how small development datasets lead to more different models in the multiverse, often with vastly unstable individual predictions, and explain how this can be exposed by using bootstrapping and presenting instability plots. We recommend healthcare researchers seek to use large model development datasets to reduce instability concerns. This is especially important to ensure reliability across subgroups and improve model fairness in practice.CONCLUSIONS: Instability is concerning as an individual's predicted value is used to guide their counselling, resource prioritisation, and clinical decision making. If different samples lead to different models with very different predictions for the same individual, then this should cast doubt into using a particular model for that individual. Therefore, visualising, quantifying and reporting the instability in individual-level predictions is essential when proposing a new model.

KW - Humans

KW - Prognosis

KW - Models, Statistical

KW - Reproducibility of Results

U2 - 10.1186/s12916-023-03212-y

DO - 10.1186/s12916-023-03212-y

M3 - Article

C2 - 38110939

SN - 1741-7015

VL - 21

JO - BMC medicine

JF - BMC medicine

IS - 1

M1 - 502

ER -

Clinical prediction models and the multiverse of madness

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Projects

Sample Size guidance for developing and validating reliable and fair AI PREDICTion models in healthcare (SS-PREDICT)

Cite this