Whole-genome phenotype prediction with machine learning: open problems in bacterial genomics

Tamsin James*, Ben Williamson, Peter Tino, Nicole Wheeler

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

26 Downloads (Pure)

Abstract

MOTIVATION: How can we identify causal genetic mechanisms governing bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype yield high accuracy scores. However, attempts to extract meaningful interpretations from the predictive models are found to be corrupted by falsely identified 'causal' features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those approaches to learning causal effects, and discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.

RESULTS: We identify major sources of non-injectivity in the formulation of the genotype-to-phenotype mapping function-linkage-disequilibrium, limited sampling, information loss in representations, unmeasured confounders and observational noise-and analyse their implications for machine learning applications. Using a collection of 4,140 Staphylococcus aureus isolates, we illustrate challenges surrounding the defined open problems.

AVAILABILITY AND IMPLEMENTATION: Raw sequencing data are available from the European Nucleotide Archive (ENA) under project accessions ERP001012, PRJEB3174, PRJEB2655, PRJEB2756, and PRJEB2944. Assemblies and annotations were generated with the Sanger bacterial pipeline (https://github.com/sanger-pathogens/vr-codebase) and unitigs extracted using DBGWAS (https://gitlab.com/leoisl/dbgwas).

Original languageEnglish
Article numberbtaf206
Number of pages9
JournalBioinformatics
Volume41
Issue number7
Early online date23 Jun 2025
DOIs
Publication statusPublished - Jul 2025

Bibliographical note

© The Author(s) 2025. Published by Oxford University Press.

Keywords

  • Machine Learning
  • Genome, Bacterial
  • Phenotype
  • Genomics/methods
  • Genotype
  • Staphylococcus aureus/genetics
  • Bacteria/genetics

Fingerprint

Dive into the research topics of 'Whole-genome phenotype prediction with machine learning: open problems in bacterial genomics'. Together they form a unique fingerprint.

Cite this