Abstract
Background: Throughout the Covid-19 pandemic artificial intelligence (AI) models were developed in response to significant resource constraints affecting healthcare systems. Previous systematic reviews demonstrate that healthcare datasets often have significant limitations, contributing to bias in any AI health technologies they are used to develop. This systematic review aimed to characterise the composition and reporting of datasets created throughout the Covid-19 pandemic, and highlight key deficiencies which could affect downstream AI models.
Methods: A systematic search of MEDLINE identified articles describing datasets used for AI health technology development. Studies were screened for eligibility, and datasets collated for analysis. Google Dataset Search was used to identify additional datasets. After deduplication and exclusion of datasets not related to Covid-19 or those not containing data relating to individual humans, dataset documentation was assessed for the completeness of metadata reporting, their composition, the means of data access and any restrictions, ethical considerations, and other factors.
Findings: 192 datasets were analysed. Metadata were often incomplete or absent. Only 48% of datasets’ documentation described the country where data originated, 43% reported the age of individuals included, and under 25% reported sex, gender, race, ethnicity or any other attributes. Most datasets provided no information on data labelling, ethical review, or consent for data sharing. Many datasets reproduced data from other datasets, sometimes without linking to the original source. We found multiple cases where paediatric chest X-ray images from prior to the Covid-19 pandemic were reproduced in datasets without this being acknowledged.
Interpretation: This review highlights substantial deficiencies in the documentation of many Covid-19 datasets. It is imperative to balance data availability with data quality in future health emergencies, or else we risk developing biased AI health technologies which do more harm than good.
Funding: This review was funded by The NHS AI Lab and The Health Foundation, and supported by the National Institute for Health and Care Research (AI_HI200014).
Methods: A systematic search of MEDLINE identified articles describing datasets used for AI health technology development. Studies were screened for eligibility, and datasets collated for analysis. Google Dataset Search was used to identify additional datasets. After deduplication and exclusion of datasets not related to Covid-19 or those not containing data relating to individual humans, dataset documentation was assessed for the completeness of metadata reporting, their composition, the means of data access and any restrictions, ethical considerations, and other factors.
Findings: 192 datasets were analysed. Metadata were often incomplete or absent. Only 48% of datasets’ documentation described the country where data originated, 43% reported the age of individuals included, and under 25% reported sex, gender, race, ethnicity or any other attributes. Most datasets provided no information on data labelling, ethical review, or consent for data sharing. Many datasets reproduced data from other datasets, sometimes without linking to the original source. We found multiple cases where paediatric chest X-ray images from prior to the Covid-19 pandemic were reproduced in datasets without this being acknowledged.
Interpretation: This review highlights substantial deficiencies in the documentation of many Covid-19 datasets. It is imperative to balance data availability with data quality in future health emergencies, or else we risk developing biased AI health technologies which do more harm than good.
Funding: This review was funded by The NHS AI Lab and The Health Foundation, and supported by the National Institute for Health and Care Research (AI_HI200014).
Original language | English |
---|---|
Pages (from-to) | e827-e847 |
Number of pages | 21 |
Journal | The Lancet Digital Health |
Volume | 6 |
Issue number | 11 |
Early online date | 23 Oct 2024 |
DOIs | |
Publication status | Published - Nov 2024 |