Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.

Ross D. King; Andreas Karwath; Amanda Clare; Luc Dehaspe

doi:10.1002/1097-0061(200012)17:4&lt;283::AID-YEA52&gt;3.0.CO;2-F

Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.

Ross D. King, Andreas Karwath, Amanda Clare, Luc Dehaspe

Research output: Contribution to journal › Article › peer-review

50 Citations (Scopus)

Abstract

The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.

Original language	English
Pages (from-to)	283-293
Number of pages	11
Journal	Yeast
Volume	17
DOIs	https://doi.org/10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F
Publication status	Published - 2000

Keywords

bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F

http://onlinelibrary.wiley.com/doi/10.1002/1097-0061(200012)17:4%3C283::AID-YEA52%3E3.0.CO;2-F/abstract

Cite this

@article{a22ba3a3f2f84563832a5e50b5d01fbc,

title = "Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.",

abstract = "The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.",

keywords = "bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge",

author = "King, {Ross D.} and Andreas Karwath and Amanda Clare and Luc Dehaspe",

year = "2000",

doi = "10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F",

language = "English",

volume = "17",

pages = "283--293",

journal = "Yeast",

issn = "0749-503X",

publisher = "Wiley",

}

TY - JOUR

T1 - Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.

AU - King, Ross D.

AU - Karwath, Andreas

AU - Clare, Amanda

AU - Dehaspe, Luc

PY - 2000

Y1 - 2000

N2 - The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.

AB - The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.

KW - bioinformatics, data mining, inductive logic programming, relational learning, scientific knowledge

U2 - 10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F

DO - 10.1002/1097-0061(200012)17:4<283::AID-YEA52>3.0.CO;2-F

M3 - Article

SN - 0749-503X

VL - 17

SP - 283

EP - 293

JO - Yeast

JF - Yeast

ER -

Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining.

Abstract

Keywords

UN SDGs

Access to Document

Fingerprint

Cite this