DivA: detection of non-homologous and very divergent regions in protein sequence alignments

Marie Zepeda Mendoza; Sanne Nygaard; Rute R da Fonseca

doi:10.1186/1756-0500-7-806

DivA: detection of non-homologous and very divergent regions in protein sequence alignments

Marie Zepeda Mendoza, Sanne Nygaard, Rute R da Fonseca

Microbiology and Infection

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

171 Downloads (Pure)

Abstract

BACKGROUND: Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.

FINDINGS: Here we present a tool that identifies Divergent windows in protein sequence Alignments (DivA). DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA works with a sliding-window approach to estimate four divergence-based parameters and their outlier values. It then classifies a window of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters it calculates. The windows classified as very divergent can optionally be masked in the alignment.

CONCLUSIONS: DivA automatically identifies very divergent and incorrectly annotated genic regions in MSAs avoiding the subjective and time-consuming problem of manual annotation. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.

Original language	English
Article number	806
Number of pages	6
Journal	BMC Research Notes
Volume	7
DOIs	https://doi.org/10.1186/1756-0500-7-806
Publication status	Published - 18 Nov 2014

Keywords

Amino Acid Sequence
Animals
Databases, Protein
Humans
Molecular Sequence Data
Proteins/chemistry
Sequence Alignment
Sequence Homology, Amino Acid
Software

Access to Document

10.1186/1756-0500-7-806Licence: Creative Commons: Attribution (CC BY)

Marie_Lisandra_Zepeda_Mendoza_et_al_DivA_BMC_Research_Notes_2014
Checked for eligibility: 08/01/2018
Final published version, 0.98 MBLicence: Creative Commons: Attribution (CC BY)

Cite this

@article{86e853ab37524cb3a655763e0cd4fecc,

title = "DivA: detection of non-homologous and very divergent regions in protein sequence alignments",

abstract = "BACKGROUND: Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.FINDINGS: Here we present a tool that identifies Divergent windows in protein sequence Alignments (DivA). DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA works with a sliding-window approach to estimate four divergence-based parameters and their outlier values. It then classifies a window of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters it calculates. The windows classified as very divergent can optionally be masked in the alignment.CONCLUSIONS: DivA automatically identifies very divergent and incorrectly annotated genic regions in MSAs avoiding the subjective and time-consuming problem of manual annotation. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.",

keywords = "Amino Acid Sequence, Animals, Databases, Protein, Humans, Molecular Sequence Data, Proteins/chemistry, Sequence Alignment, Sequence Homology, Amino Acid, Software",

author = "{Zepeda Mendoza}, Marie and Sanne Nygaard and {da Fonseca}, {Rute R}",

year = "2014",

month = nov,

day = "18",

doi = "10.1186/1756-0500-7-806",

language = "English",

volume = "7",

journal = "BMC Research Notes",

issn = "1756-0500",

publisher = "Springer",

}

TY - JOUR

T1 - DivA

T2 - detection of non-homologous and very divergent regions in protein sequence alignments

AU - Zepeda Mendoza, Marie

AU - Nygaard, Sanne

AU - da Fonseca, Rute R

PY - 2014/11/18

Y1 - 2014/11/18

N2 - BACKGROUND: Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.FINDINGS: Here we present a tool that identifies Divergent windows in protein sequence Alignments (DivA). DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA works with a sliding-window approach to estimate four divergence-based parameters and their outlier values. It then classifies a window of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters it calculates. The windows classified as very divergent can optionally be masked in the alignment.CONCLUSIONS: DivA automatically identifies very divergent and incorrectly annotated genic regions in MSAs avoiding the subjective and time-consuming problem of manual annotation. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.

AB - BACKGROUND: Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.FINDINGS: Here we present a tool that identifies Divergent windows in protein sequence Alignments (DivA). DivA makes no assumptions on evolutionary models, and it is ideal for detecting incorrectly annotated segments within individual gene sequences. DivA works with a sliding-window approach to estimate four divergence-based parameters and their outlier values. It then classifies a window of a sequence of an alignment as very divergent (potentially non-homologous) if it presents a combination of outlier values for the four parameters it calculates. The windows classified as very divergent can optionally be masked in the alignment.CONCLUSIONS: DivA automatically identifies very divergent and incorrectly annotated genic regions in MSAs avoiding the subjective and time-consuming problem of manual annotation. The output is clear to interpret and allows the user to take more informed decisions for reducing the amount of sequence discarded but still finding the potentially erroneous and non-homologous regions.

KW - Amino Acid Sequence

KW - Animals

KW - Databases, Protein

KW - Humans

KW - Molecular Sequence Data

KW - Proteins/chemistry

KW - Sequence Alignment

KW - Sequence Homology, Amino Acid

KW - Software

U2 - 10.1186/1756-0500-7-806

DO - 10.1186/1756-0500-7-806

M3 - Article

C2 - 25403086

SN - 1756-0500

VL - 7

JO - BMC Research Notes

JF - BMC Research Notes

M1 - 806

ER -

DivA: detection of non-homologous and very divergent regions in protein sequence alignments

Abstract

Keywords

Access to Document

Fingerprint

Cite this