Robust twin boosting for feature selection from high-dimensional omics data with label noise

Shan He; Huanhuan Chen; Zexuan Zhu; Douglas G. Ward; Helen J. Cooper; Mark R. Viant; John K. Heath; Xin Yao

doi:10.1016/j.ins.2014.08.048

Robust twin boosting for feature selection from high-dimensional omics data with label noise

Shan He, Huanhuan Chen, Zexuan Zhu^*, Douglas G. Ward, Helen J. Cooper, Mark R. Viant, John K. Heath, Xin Yao

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

23 Citations (Scopus)

729 Downloads (Pure)

Abstract

Omics data such as microarray transcriptomic and mass spectrometry proteomic data are typically characterized by high dimensionality and relatively small sample sizes. In order to discover biomarkers for diagnosis and prognosis from omics data, feature selection has become an indispensable step to find a parsimonious set of informative features. However, many previous studies report considerable label noise in omics data, which will lead to unreliable inferences to select uninformative features. Yet, to the best of our knowledge, very few feature selection methods are proposed to address this problem. This paper proposes a novel ensemble feature selection algorithm, robust twin boosting feature selection (RTBFS), which is robust to label noise in omics data. The algorithm has been validated on an omics feature selection test bed and seven real-world heterogeneous omics datasets, of which some are known to have label noise. Compared with several state-of-the-art ensemble feature selection methods, RTBFS can select more informative features despite label noise and obtain better classification results. RTBFS is a general feature selection method and can be applied to other data with label noise. MATLAB implementation of RTBFS and sample datasets are available at: http://www.cs.bham.ac.uk/∼szh/TReBFSMatlab.zip.

Original language	English
Pages (from-to)	1-18
Number of pages	18
Journal	Information Sciences
Volume	291
Early online date	30 Aug 2014
DOIs	https://doi.org/10.1016/j.ins.2014.08.048
Publication status	Published - 1 Jan 2015

Keywords

Boosting
Ensemble learning
Feature selection

ASJC Scopus subject areas

Artificial Intelligence
Software
Control and Systems Engineering
Theoretical Computer Science
Computer Science Applications
Information Systems and Management

Access to Document

10.1016/j.ins.2014.08.048

He_Robust_twin_boosting_Information_Sciences_2014
NOTICE: this is the author’s version of a work that was accepted for publication in Information Sciences. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Sciences [VOL 291, January 2015] DOI: 10.1016/j.ins.2014.08.048
Accepted author manuscript, 450 KBLicence: Other (please specify with Rights Statement)

Cite this

@article{63c4eee5b2b6473687a4ab94f5b52972,

title = "Robust twin boosting for feature selection from high-dimensional omics data with label noise",

abstract = "Omics data such as microarray transcriptomic and mass spectrometry proteomic data are typically characterized by high dimensionality and relatively small sample sizes. In order to discover biomarkers for diagnosis and prognosis from omics data, feature selection has become an indispensable step to find a parsimonious set of informative features. However, many previous studies report considerable label noise in omics data, which will lead to unreliable inferences to select uninformative features. Yet, to the best of our knowledge, very few feature selection methods are proposed to address this problem. This paper proposes a novel ensemble feature selection algorithm, robust twin boosting feature selection (RTBFS), which is robust to label noise in omics data. The algorithm has been validated on an omics feature selection test bed and seven real-world heterogeneous omics datasets, of which some are known to have label noise. Compared with several state-of-the-art ensemble feature selection methods, RTBFS can select more informative features despite label noise and obtain better classification results. RTBFS is a general feature selection method and can be applied to other data with label noise. MATLAB implementation of RTBFS and sample datasets are available at: http://www.cs.bham.ac.uk/∼szh/TReBFSMatlab.zip.",

keywords = "Boosting, Ensemble learning, Feature selection",

author = "Shan He and Huanhuan Chen and Zexuan Zhu and Ward, {Douglas G.} and Cooper, {Helen J.} and Viant, {Mark R.} and Heath, {John K.} and Xin Yao",

year = "2015",

month = jan,

day = "1",

doi = "10.1016/j.ins.2014.08.048",

language = "English",

volume = "291",

pages = "1--18",

journal = "Information Sciences",

issn = "0020-0255",

publisher = "Elsevier",

}

TY - JOUR

T1 - Robust twin boosting for feature selection from high-dimensional omics data with label noise

AU - He, Shan

AU - Chen, Huanhuan

AU - Zhu, Zexuan

AU - Ward, Douglas G.

AU - Cooper, Helen J.

AU - Viant, Mark R.

AU - Heath, John K.

AU - Yao, Xin

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Omics data such as microarray transcriptomic and mass spectrometry proteomic data are typically characterized by high dimensionality and relatively small sample sizes. In order to discover biomarkers for diagnosis and prognosis from omics data, feature selection has become an indispensable step to find a parsimonious set of informative features. However, many previous studies report considerable label noise in omics data, which will lead to unreliable inferences to select uninformative features. Yet, to the best of our knowledge, very few feature selection methods are proposed to address this problem. This paper proposes a novel ensemble feature selection algorithm, robust twin boosting feature selection (RTBFS), which is robust to label noise in omics data. The algorithm has been validated on an omics feature selection test bed and seven real-world heterogeneous omics datasets, of which some are known to have label noise. Compared with several state-of-the-art ensemble feature selection methods, RTBFS can select more informative features despite label noise and obtain better classification results. RTBFS is a general feature selection method and can be applied to other data with label noise. MATLAB implementation of RTBFS and sample datasets are available at: http://www.cs.bham.ac.uk/∼szh/TReBFSMatlab.zip.

AB - Omics data such as microarray transcriptomic and mass spectrometry proteomic data are typically characterized by high dimensionality and relatively small sample sizes. In order to discover biomarkers for diagnosis and prognosis from omics data, feature selection has become an indispensable step to find a parsimonious set of informative features. However, many previous studies report considerable label noise in omics data, which will lead to unreliable inferences to select uninformative features. Yet, to the best of our knowledge, very few feature selection methods are proposed to address this problem. This paper proposes a novel ensemble feature selection algorithm, robust twin boosting feature selection (RTBFS), which is robust to label noise in omics data. The algorithm has been validated on an omics feature selection test bed and seven real-world heterogeneous omics datasets, of which some are known to have label noise. Compared with several state-of-the-art ensemble feature selection methods, RTBFS can select more informative features despite label noise and obtain better classification results. RTBFS is a general feature selection method and can be applied to other data with label noise. MATLAB implementation of RTBFS and sample datasets are available at: http://www.cs.bham.ac.uk/∼szh/TReBFSMatlab.zip.

KW - Boosting

KW - Ensemble learning

KW - Feature selection

UR - http://www.scopus.com/inward/record.url?scp=84923329397&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2014.08.048

DO - 10.1016/j.ins.2014.08.048

M3 - Article

SN - 0020-0255

VL - 291

SP - 1

EP - 18

JO - Information Sciences

JF - Information Sciences

ER -

Robust twin boosting for feature selection from high-dimensional omics data with label noise

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this