MapRDD: Finer grained resilient distributed dataset for machine learning

Zhenyu Li, Stephen Jarvis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.

Original languageEnglish
Title of host publicationProceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450357036
DOIs
Publication statusPublished - 15 Jun 2018
Event5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 - Houston, United States
Duration: 15 Jun 2018 → …

Publication series

NameProceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018

Conference

Conference5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018
Country/TerritoryUnited States
CityHouston
Period15/06/18 → …

Bibliographical note

Publisher Copyright:
© 2018 ACM.

Keywords

  • Apache spark
  • Graphical processing units
  • Heterogeneous architecture
  • Random sampling
  • Resilient distributed dataset

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'MapRDD: Finer grained resilient distributed dataset for machine learning'. Together they form a unique fingerprint.

Cite this