Revealing transparency gaps in publicly available Covid-19 datasets used for medical artificial intelligence development: a systematic review

Joseph Alderman, Maria Charalambides, Gagandeep Sachdeva, Elinor Laws, Joanne Palmer, Elsa Lee, Vaishnavi Menon, Qasim Malik, Sonam Vadera, Melanie Calvert, Marzyeh Ghassemi, Melissa D McCradden, Johan Ordish, Bilal Mateen, Charlotte Summers, Jacqui Gath, Rubeta N. Matin, Alastair K Denniston, Xiaoxuan Liu*

*Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

24 Downloads (Pure)

Abstract

Background: Throughout the Covid-19 pandemic artificial intelligence (AI) models were developed in response to significant resource constraints affecting healthcare systems. Previous systematic reviews demonstrate that healthcare datasets often have significant limitations, contributing to bias in any AI health technologies they are used to develop. This systematic review aimed to characterise the composition and reporting of datasets created throughout the Covid-19 pandemic, and highlight key deficiencies which could affect downstream AI models.

Methods: A systematic search of MEDLINE identified articles describing datasets used for AI health technology development. Studies were screened for eligibility, and datasets collated for analysis. Google Dataset Search was used to identify additional datasets. After deduplication and exclusion of datasets not related to Covid-19 or those not containing data relating to individual humans, dataset documentation was assessed for the completeness of metadata reporting, their composition, the means of data access and any restrictions, ethical considerations, and other factors.

Findings: 192 datasets were analysed. Metadata were often incomplete or absent. Only 48% of datasets’ documentation described the country where data originated, 43% reported the age of individuals included, and under 25% reported sex, gender, race, ethnicity or any other attributes. Most datasets provided no information on data labelling, ethical review, or consent for data sharing. Many datasets reproduced data from other datasets, sometimes without linking to the original source. We found multiple cases where paediatric chest X-ray images from prior to the Covid-19 pandemic were reproduced in datasets without this being acknowledged.

Interpretation: This review highlights substantial deficiencies in the documentation of many Covid-19 datasets. It is imperative to balance data availability with data quality in future health emergencies, or else we risk developing biased AI health technologies which do more harm than good.

Funding: This review was funded by The NHS AI Lab and The Health Foundation, and supported by the National Institute for Health and Care Research (AI_HI200014).
Original languageEnglish
Pages (from-to)e827-e847
Number of pages21
JournalThe Lancet Digital Health
Volume6
Issue number11
Early online date23 Oct 2024
DOIs
Publication statusPublished - Nov 2024

Fingerprint

Dive into the research topics of 'Revealing transparency gaps in publicly available Covid-19 datasets used for medical artificial intelligence development: a systematic review'. Together they form a unique fingerprint.

Cite this