Nanopore sequencing and assembly of a human genome with ultra-long reads

Research output: Contribution to journalArticlepeer-review

Authors

  • Miten Jain
  • Sergey Koren
  • Karen H Miga
  • Arthur C Rand
  • Thomas A Sasani
  • John R Tyson
  • Alexander T Dilthey
  • Ian T Fiddes
  • Sunir Malla
  • Hannah Marriott
  • Tom Nieto
  • Justin O'Grady
  • Hugh E Olsen
  • Brent S Pedersen
  • Arang Rhie
  • Hollian Richardson
  • Aaron R Quinlan
  • Terrance P Snutch
  • Louise Tee
  • Benedict Paten
  • Adam M Phillippy
  • Jared T Simpson
  • Matthew Loose

Colleges, School and Institutes

External organisations

  • UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California, USA.
  • Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA.
  • USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA.
  • Michael Smith Laboratories and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, Canada.
  • DeepSeq, School of Life Sciences, University of Nottingham, Nottingham, UK.
  • University of East Anglia
  • Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.
  • Department of Computer Science, University of Toronto, Toronto, Canada.

Abstract

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

Details

Original languageEnglish
Pages (from-to)338–345
JournalNature Biotechnology
Volume36
Issue number4
Early online date29 Jan 2018
Publication statusPublished - Apr 2018