MapRDD: Finer grained resilient distributed dataset for machine learning

Zhenyu Li; Stephen Jarvis

doi:10.1145/3206333.3206335

MapRDD: Finer grained resilient distributed dataset for machine learning

Zhenyu Li, Stephen Jarvis

Engineering and Physical Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Citation (Scopus)

Abstract

The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.

Original language	English
Title of host publication	Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018
Publisher	Association for Computing Machinery
ISBN (Electronic)	9781450357036
DOIs	https://doi.org/10.1145/3206333.3206335
Publication status	Published - 15 Jun 2018
Event	5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 - Houston, United States Duration: 15 Jun 2018 → …

Publication series

Name	Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018

Conference

Conference	5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018
Country/Territory	United States
City	Houston
Period	15/06/18 → …

Bibliographical note

Publisher Copyright:
© 2018 ACM.

Keywords

Apache spark
Graphical processing units
Heterogeneous architecture
Random sampling
Resilient distributed dataset

ASJC Scopus subject areas

Hardware and Architecture

Access to Document

10.1145/3206333.3206335

Cite this

Li, Z., & Jarvis, S. (2018). MapRDD: Finer grained resilient distributed dataset for machine learning. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 Article 3206335 (Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018). Association for Computing Machinery . https://doi.org/10.1145/3206333.3206335

Li, Zhenyu ; Jarvis, Stephen. / MapRDD : Finer grained resilient distributed dataset for machine learning. Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018. Association for Computing Machinery , 2018. (Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018).

@inproceedings{09f8e3300d7b4e8a9edd7688308d51dd,

title = "MapRDD: Finer grained resilient distributed dataset for machine learning",

abstract = "The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.",

keywords = "Apache spark, Graphical processing units, Heterogeneous architecture, Random sampling, Resilient distributed dataset",

author = "Zhenyu Li and Stephen Jarvis",

note = "Publisher Copyright: {\textcopyright} 2018 ACM.; 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 ; Conference date: 15-06-2018",

year = "2018",

month = jun,

day = "15",

doi = "10.1145/3206333.3206335",

language = "English",

series = "Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018",

publisher = "Association for Computing Machinery ",

booktitle = "Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018",

}

Li, Z & Jarvis, S 2018, MapRDD: Finer grained resilient distributed dataset for machine learning. in Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018., 3206335, Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018, Association for Computing Machinery , 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018, Houston, United States, 15/06/18. https://doi.org/10.1145/3206333.3206335

MapRDD: Finer grained resilient distributed dataset for machine learning. / Li, Zhenyu; Jarvis, Stephen.
Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018. Association for Computing Machinery , 2018. 3206335 (Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - MapRDD

T2 - 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018

AU - Li, Zhenyu

AU - Jarvis, Stephen

PY - 2018/6/15

Y1 - 2018/6/15

N2 - The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.

AB - The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.

KW - Apache spark

KW - Graphical processing units

KW - Heterogeneous architecture

KW - Random sampling

KW - Resilient distributed dataset

UR - http://www.scopus.com/inward/record.url?scp=85063043278&partnerID=8YFLogxK

U2 - 10.1145/3206333.3206335

DO - 10.1145/3206333.3206335

M3 - Conference contribution

AN - SCOPUS:85063043278

T3 - Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018

BT - Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018

PB - Association for Computing Machinery

Y2 - 15 June 2018

ER -

Li Z, Jarvis S. MapRDD: Finer grained resilient distributed dataset for machine learning. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018. Association for Computing Machinery . 2018. 3206335. (Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018). doi: 10.1145/3206333.3206335

MapRDD: Finer grained resilient distributed dataset for machine learning

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this