Nanopore sequencing and assembly of a human genome with ultra-long reads

Research output: Contribution to journalArticle

Authors

  • Miten Jain
  • Sergey Koren
  • Karen H Miga
  • Arthur C Rand
  • Thomas A Sasani
  • John R Tyson
  • Alexander T Dilthey
  • Ian T Fiddes
  • Sunir Malla
  • Hannah Marriott
  • Tom Nieto
  • Justin O'Grady
  • Hugh E Olsen
  • Brent S Pedersen
  • Arang Rhie
  • Hollian Richardson
  • Aaron R Quinlan
  • Terrance P Snutch
  • Louise Tee
  • Benedict Paten
  • Adam M Phillippy
  • Jared T Simpson
  • Matthew Loose

Colleges, School and Institutes

External organisations

  • UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA.
  • Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA.
  • USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA.
  • Michael Smith Laboratories and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, Canada.
  • Surgical Research Laboratory, Institute of Cancer & Genomic Science, University of Birmingham, UK.
  • DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK.
  • University of East Anglia
  • Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.
  • Department of Computer Science, University of Toronto, Toronto, Canada.

Abstract

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

Details

Original languageEnglish
JournalNature Biotechnology
Early online date29 Jan 2018
Publication statusE-pub ahead of print - 29 Jan 2018