Unsupervised methods in LC-MS data treatment: application for potential chemotaxonomic markers search

Polina Turova; Iain Styles; Vladimir Timashev; Konstantin  Kravets; Alexander  Grechnikov; Dmitry  Lyskov; Tahir  Samigullin; Ilya  Podolskiy; Oleg  Shpigun; Andrey Stavrianidi

doi:10.1016/j.jpba.2021.114382

Unsupervised methods in LC-MS data treatment: application for potential chemotaxonomic markers search

Polina Turova, Iain Styles, Vladimir Timashev, Konstantin Kravets, Alexander Grechnikov, Dmitry Lyskov, Tahir Samigullin, Ilya Podolskiy, Oleg Shpigun, Andrey Stavrianidi

Computer Science

Research output: Contribution to journal › Article › peer-review

35 Downloads (Pure)

Abstract

The combination of Liquid Chromatography and Mass Spectrometry (LC-MS) is commonly used to determine and characterize biologically active compounds because of its high resolution and sensitivity. In this work we explore the interpretation of LC-MS data using multivariate statistical analysis algorithms to extract useful chemical information and identify clusters of similar samples. Samples of leaves from 19 plants belonging to the Apiaceae family were analyzed in unified LC conditions by high- and low-resolution mass spectrometry in a wide range scan mode. LC-MS data preprocessing was performed followed by statistical analysis using tensor decomposition in the form of Parallel Factor Analysis (PARAFAC); matrix factorization following tensor unfolding with principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF); or unsupervised feature selection (UFS). The optimal number of components for each of these methods were found and results were compared using four different metrics: silhouette score, Davies-Bouldin index, computational time, number of noisy components. It was found that PCA, ICA and UFS give the best results across the majority of the criteria for both low- and high-resolution data. An algorithm for biomarker signal selection is suggested and 23 potential chemotaxonomic markers were tentatively identified using MS² data. Dendrograms constructed by the methods were compared to the molecular phylogenic tree by calculating pixel-wise mean square error (MSE). Therefore, the suggested approach can support chemotaxonomic studies and yield valuable chemical information for biomarker discovery.

Original language	English
Article number	114382
Number of pages	10
Journal	Journal of Pharmaceutical and Biomedical Analysis
Volume	206
Early online date	21 Sept 2021
DOIs	https://doi.org/10.1016/j.jpba.2021.114382
Publication status	Published - 30 Nov 2021

Keywords

Apiaceae
Liquid chromatography
Machine learning
Mass spectrometry
Multi-way data

Access to Document

10.1016/j.jpba.2021.114382Licence: None: All rights reserved

TurovaP2021UnsupervisedAccepted author manuscript, 973 KBLicence: Creative Commons: Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)

Cite this

@article{faf66488c99f40afb3dddaedd28c0b4f,

title = "Unsupervised methods in LC-MS data treatment: application for potential chemotaxonomic markers search",

abstract = "The combination of Liquid Chromatography and Mass Spectrometry (LC-MS) is commonly used to determine and characterize biologically active compounds because of its high resolution and sensitivity. In this work we explore the interpretation of LC-MS data using multivariate statistical analysis algorithms to extract useful chemical information and identify clusters of similar samples. Samples of leaves from 19 plants belonging to the Apiaceae family were analyzed in unified LC conditions by high- and low-resolution mass spectrometry in a wide range scan mode. LC-MS data preprocessing was performed followed by statistical analysis using tensor decomposition in the form of Parallel Factor Analysis (PARAFAC); matrix factorization following tensor unfolding with principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF); or unsupervised feature selection (UFS). The optimal number of components for each of these methods were found and results were compared using four different metrics: silhouette score, Davies-Bouldin index, computational time, number of noisy components. It was found that PCA, ICA and UFS give the best results across the majority of the criteria for both low- and high-resolution data. An algorithm for biomarker signal selection is suggested and 23 potential chemotaxonomic markers were tentatively identified using MS2 data. Dendrograms constructed by the methods were compared to the molecular phylogenic tree by calculating pixel-wise mean square error (MSE). Therefore, the suggested approach can support chemotaxonomic studies and yield valuable chemical information for biomarker discovery.",

keywords = "Apiaceae, Liquid chromatography, Machine learning, Mass spectrometry, Multi-way data",

author = "Polina Turova and Iain Styles and Vladimir Timashev and Konstantin Kravets and Alexander Grechnikov and Dmitry Lyskov and Tahir Samigullin and Ilya Podolskiy and Oleg Shpigun and Andrey Stavrianidi",

year = "2021",

month = nov,

day = "30",

doi = "10.1016/j.jpba.2021.114382",

language = "English",

volume = "206",

journal = "Journal of Pharmaceutical and Biomedical Analysis",

issn = "0731-7085",

publisher = "Elsevier",

}

TY - JOUR

T1 - Unsupervised methods in LC-MS data treatment

T2 - application for potential chemotaxonomic markers search

AU - Turova, Polina

AU - Styles, Iain

AU - Timashev, Vladimir

AU - Kravets, Konstantin

AU - Grechnikov, Alexander

AU - Lyskov, Dmitry

AU - Samigullin, Tahir

AU - Podolskiy, Ilya

AU - Shpigun, Oleg

AU - Stavrianidi, Andrey

PY - 2021/11/30

Y1 - 2021/11/30

N2 - The combination of Liquid Chromatography and Mass Spectrometry (LC-MS) is commonly used to determine and characterize biologically active compounds because of its high resolution and sensitivity. In this work we explore the interpretation of LC-MS data using multivariate statistical analysis algorithms to extract useful chemical information and identify clusters of similar samples. Samples of leaves from 19 plants belonging to the Apiaceae family were analyzed in unified LC conditions by high- and low-resolution mass spectrometry in a wide range scan mode. LC-MS data preprocessing was performed followed by statistical analysis using tensor decomposition in the form of Parallel Factor Analysis (PARAFAC); matrix factorization following tensor unfolding with principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF); or unsupervised feature selection (UFS). The optimal number of components for each of these methods were found and results were compared using four different metrics: silhouette score, Davies-Bouldin index, computational time, number of noisy components. It was found that PCA, ICA and UFS give the best results across the majority of the criteria for both low- and high-resolution data. An algorithm for biomarker signal selection is suggested and 23 potential chemotaxonomic markers were tentatively identified using MS2 data. Dendrograms constructed by the methods were compared to the molecular phylogenic tree by calculating pixel-wise mean square error (MSE). Therefore, the suggested approach can support chemotaxonomic studies and yield valuable chemical information for biomarker discovery.

AB - The combination of Liquid Chromatography and Mass Spectrometry (LC-MS) is commonly used to determine and characterize biologically active compounds because of its high resolution and sensitivity. In this work we explore the interpretation of LC-MS data using multivariate statistical analysis algorithms to extract useful chemical information and identify clusters of similar samples. Samples of leaves from 19 plants belonging to the Apiaceae family were analyzed in unified LC conditions by high- and low-resolution mass spectrometry in a wide range scan mode. LC-MS data preprocessing was performed followed by statistical analysis using tensor decomposition in the form of Parallel Factor Analysis (PARAFAC); matrix factorization following tensor unfolding with principal component analysis (PCA), independent component analysis (ICA), non-negative matrix factorization (NMF); or unsupervised feature selection (UFS). The optimal number of components for each of these methods were found and results were compared using four different metrics: silhouette score, Davies-Bouldin index, computational time, number of noisy components. It was found that PCA, ICA and UFS give the best results across the majority of the criteria for both low- and high-resolution data. An algorithm for biomarker signal selection is suggested and 23 potential chemotaxonomic markers were tentatively identified using MS2 data. Dendrograms constructed by the methods were compared to the molecular phylogenic tree by calculating pixel-wise mean square error (MSE). Therefore, the suggested approach can support chemotaxonomic studies and yield valuable chemical information for biomarker discovery.

KW - Apiaceae

KW - Liquid chromatography

KW - Machine learning

KW - Mass spectrometry

KW - Multi-way data

UR - http://www.scopus.com/inward/record.url?scp=85116071218&partnerID=8YFLogxK

U2 - 10.1016/j.jpba.2021.114382

DO - 10.1016/j.jpba.2021.114382

M3 - Article

SN - 0731-7085

VL - 206

JO - Journal of Pharmaceutical and Biomedical Analysis

JF - Journal of Pharmaceutical and Biomedical Analysis

M1 - 114382

ER -

Unsupervised methods in LC-MS data treatment: application for potential chemotaxonomic markers search

Abstract

Keywords

Access to Document

Fingerprint

Cite this