RepLong - de novo repeat identification using long read sequencing data

Research output: Contribution to journalArticlepeer-review

Authors

  • Rui Guo
  • Yan-Ran Li
  • Le Ou-Yang
  • Yiwen Sun
  • Zexuan Zhu

Colleges, School and Institutes

External organisations

  • Shenzhen University, Shenzhen, China

Abstract

Motivation: The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats.
Summary: In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data.
Availability: The software of RepLong is freely available at https://github.com/ruiguo-bio/replong .

Details

Original languageEnglish
Article numberbtx717
Number of pages9
JournalBioinformatics
Early online date6 Nov 2017
Publication statusE-pub ahead of print - 6 Nov 2017