Biased sampling driven by bacterial population structure confounds machine learning prediction of antimicrobial resistance

  • Yanying Yu
  • , Nicole E. Wheeler
  • , Lars Barquist*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for the surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured, and sampling is biased towards human disease isolates, violating ML assumptions of independence between samples. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by analyzing over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens, using pathological training data where resistance is confounded with phylogeny. We show the resulting ML models perform poorly and that increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. These findings highlight the limitations of current ML approaches in the face of realistic sampling biases and underscore the need for population structure-aware methods and more diverse datasets to improve AMR prediction and surveillance.
Original languageEnglish
Article numbere3003539
Number of pages12
JournalPLoS Biology
Volume23
Issue number12
DOIs
Publication statusPublished - 16 Dec 2025

Fingerprint

Dive into the research topics of 'Biased sampling driven by bacterial population structure confounds machine learning prediction of antimicrobial resistance'. Together they form a unique fingerprint.

Cite this