Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline

O. Hrydziuszko; M.R. Viant

doi:10.1007/s11306-011-0366-4

Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline

O. Hrydziuszko, M.R. Viant

Biosciences

Research output: Contribution to journal › Article › peer-review

117 Citations (Scopus)

Abstract

Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i. e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

Original language	English
Pages (from-to)	161-174
Number of pages	14
Journal	Metabolomics
Volume	8
DOIs	https://doi.org/10.1007/s11306-011-0366-4
Publication status	Published - 1 Jun 2012

Access to Document

10.1007/s11306-011-0366-4

Cite this

@article{c2de4dfe322240879608fcd3bf11a2f1,

title = "Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline",

abstract = "Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i. e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.",

author = "O. Hrydziuszko and M.R. Viant",

year = "2012",

month = jun,

day = "1",

doi = "10.1007/s11306-011-0366-4",

language = "English",

volume = "8",

pages = "161--174",

journal = "Metabolomics",

issn = "1573-3882",

publisher = "Springer",

}

TY - JOUR

T1 - Missing values in mass spectrometry based metabolomics

T2 - An undervalued step in the data processing pipeline

AU - Hrydziuszko, O.

AU - Viant, M.R.

PY - 2012/6/1

Y1 - 2012/6/1

N2 - Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i. e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

AB - Missing values in mass spectrometry metabolomic datasets occur widely and can originate from a number of sources, including for both technical and biological reasons. Currently, little is known about these data, i. e. about their distributions across datasets, the need (or not) to consider them in the data processing pipeline, and most importantly, the optimal way of assigning them values prior to univariate or multivariate data analysis. Here, we address all of these issues using direct infusion Fourier transform ion cyclotron resonance mass spectrometry data. We have shown that missing data are widespread, accounting for ca. 20% of data and affecting up to 80% of all variables, and that they do not occur randomly but rather as a function of signal intensity and mass-to-charge ratio. We have demonstrated that missing data estimation algorithms have a major effect on the outcome of data analysis when comparing the differences between biological sample groups, including by t test, ANOVA and principal component analysis. Furthermore, results varied significantly across the eight algorithms that we assessed for their ability to impute known, but labelled as missing, entries. Based on all of our findings we identified the k-nearest neighbour imputation method (KNN) as the optimal missing value estimation approach for our direct infusion mass spectrometry datasets. However, we believe the wider significance of this study is that it highlights the importance of missing metabolite levels in the data processing pipeline and offers an approach to identify optimal ways of treating missing data in metabolomics experiments.

UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-84860380628&partnerID=8YFLogxK

U2 - 10.1007/s11306-011-0366-4

DO - 10.1007/s11306-011-0366-4

M3 - Article

AN - SCOPUS:84860380628

SN - 1573-3882

VL - 8

SP - 161

EP - 174

JO - Metabolomics

JF - Metabolomics

ER -

Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline

Abstract

Access to Document

Fingerprint

Cite this