Abstract
MOTIVATION: How can we identify causal genetic mechanisms governing bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype yield high accuracy scores. However, attempts to extract meaningful interpretations from the predictive models are found to be corrupted by falsely identified 'causal' features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those approaches to learning causal effects, and discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
RESULTS: We identify major sources of non-injectivity in the formulation of the genotype-to-phenotype mapping function-linkage-disequilibrium, limited sampling, information loss in representations, unmeasured confounders and observational noise-and analyse their implications for machine learning applications. Using a collection of 4,140 Staphylococcus aureus isolates, we illustrate challenges surrounding the defined open problems.
AVAILABILITY AND IMPLEMENTATION: Raw sequencing data are available from the European Nucleotide Archive (ENA) under project accessions ERP001012, PRJEB3174, PRJEB2655, PRJEB2756, and PRJEB2944. Assemblies and annotations were generated with the Sanger bacterial pipeline (https://github.com/sanger-pathogens/vr-codebase) and unitigs extracted using DBGWAS (https://gitlab.com/leoisl/dbgwas).
| Original language | English |
|---|---|
| Article number | btaf206 |
| Number of pages | 9 |
| Journal | Bioinformatics |
| Volume | 41 |
| Issue number | 7 |
| Early online date | 23 Jun 2025 |
| DOIs | |
| Publication status | Published - Jul 2025 |
Bibliographical note
© The Author(s) 2025. Published by Oxford University Press.Keywords
- Machine Learning
- Genome, Bacterial
- Phenotype
- Genomics/methods
- Genotype
- Staphylococcus aureus/genetics
- Bacteria/genetics