Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Nicholas Tucci, Jacek Cala, Jannetta Steyn, Paolo Missier

Research output: Contribution to journalConference articlepeer-review

Abstract

Scalable and efficient processing of genome sequence data, i.e. for variant discovery, is key to the mainstream adoption of High Throughput technology for disease prevention and for clinical use. Achieving scalability, however, requires a significant effort to enable the parallel execution of the analysis tools that make up the pipelines. This is facilitated by the new Spark versions of the well-known GATK toolkit, which offer a black-box approach by transparently exploiting the underlying Map Reduce architecture. In this paper we report on our experience implementing a standard variant discovery pipeline using GATK 4.0 with Docker-based deployment over a cluster. We provide a preliminary performance analysis, comparing the processing times and cost to those of the new Microsoft Genomics Services.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume2161
Publication statusPublished - 2018
Event26th Italian Symposium on Advanced Database Systems, SEBD 2018 - Castellaneta Marina (Taranto), Italy
Duration: 24 Jun 201827 Jun 2018

Bibliographical note

Publisher Copyright:
© 2018 CEUR-WS. All rights reserved.

Keywords

  • Cluster computing
  • Distributed processing
  • Genomics
  • Next Generation Sequencing
  • Spark
  • Variant analysis

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools'. Together they form a unique fingerprint.

Cite this