The utility of different representations of protein sequence for predicting functional class

Ross D. King; Andreas Karwath; Amanda Clare; Luc Dehaspe

doi:10.1093/bioinformatics/17.5.445

The utility of different representations of protein sequence for predicting functional class

Ross D. King, Andreas Karwath, Amanda Clare, Luc Dehaspe

Research output: Contribution to journal › Article › peer-review

63 Citations (Scopus)

Abstract

Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%. Availability: The rules and data are freely available. Warmr is free to academics. Contact: rdk@aber.ac.uk

Original language	English
Pages (from-to)	445-454
Number of pages	10
Journal	Bioinformatics
Volume	17
Issue number	5
DOIs	https://doi.org/10.1093/bioinformatics/17.5.445
Publication status	Published - 2001

Keywords

bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge

Access to Document

10.1093/bioinformatics/17.5.445

https://bioinformatics.oxfordjournals.org/content/17/5/445

Cite this

@article{9c0dc56c4f1e4937921302ae3b061bf1,

title = "The utility of different representations of protein sequence for predicting functional class",

abstract = "Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%. Availability: The rules and data are freely available. Warmr is free to academics. Contact: rdk@aber.ac.uk",

keywords = "bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge",

author = "King, {Ross D.} and Andreas Karwath and Amanda Clare and Luc Dehaspe",

year = "2001",

doi = "10.1093/bioinformatics/17.5.445",

language = "English",

volume = "17",

pages = "445--454",

journal = "Bioinformatics",

issn = "1367-4803",

publisher = "Oxford University Press",

number = "5",

}

TY - JOUR

T1 - The utility of different representations of protein sequence for predicting functional class

AU - King, Ross D.

AU - Karwath, Andreas

AU - Clare, Amanda

AU - Dehaspe, Luc

PY - 2001

Y1 - 2001

N2 - Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%. Availability: The rules and data are freely available. Warmr is free to academics. Contact: rdk@aber.ac.uk

AB - Motivation: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. Results: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%. Availability: The rules and data are freely available. Warmr is free to academics. Contact: rdk@aber.ac.uk

KW - bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge

U2 - 10.1093/bioinformatics/17.5.445

DO - 10.1093/bioinformatics/17.5.445

M3 - Article

SN - 1367-4803

VL - 17

SP - 445

EP - 454

JO - Bioinformatics

JF - Bioinformatics

IS - 5

ER -

The utility of different representations of protein sequence for predicting functional class

Abstract

Keywords

Access to Document

Fingerprint

Cite this