Abstract
Antimicrobial resistance (AMR) poses a growing threat to human health. Increasingly, genome sequencing is being applied for the surveillance of bacterial pathogens, producing a wealth of data to train machine learning (ML) applications to predict AMR and identify resistance determinants. However, bacterial populations are highly structured, and sampling is biased towards human disease isolates, violating ML assumptions of independence between samples. This is rarely considered in applications of ML to AMR. Here, we demonstrate the confounding effects of sample structure by analyzing over 24,000 whole genome sequences and AMR phenotypes from five diverse pathogens, using pathological training data where resistance is confounded with phylogeny. We show the resulting ML models perform poorly and that increasing the training sample size fails to rescue performance. A comprehensive analysis of 6,740 models identifies species- and drug-specific effects on model accuracy. These findings highlight the limitations of current ML approaches in the face of realistic sampling biases and underscore the need for population structure-aware methods and more diverse datasets to improve AMR prediction and surveillance.
| Original language | English |
|---|---|
| Article number | e3003539 |
| Number of pages | 12 |
| Journal | PLoS Biology |
| Volume | 23 |
| Issue number | 12 |
| DOIs | |
| Publication status | Published - 16 Dec 2025 |