Abstract
The Resilient Distributed Dataset (RDD) is the core memory abstraction behind the popular data-analytic framework Apache Spark. We present an extension to the Resilient Distributed Dataset for map transformations, that we call MapRDD, which takes advantage of the underlying relations between records in the parent and child datasets, in order to achieve random-access of individual records in a partition. The design is complemented by a new MemoryStore, which manages data sampling and data transfers asynchronously. We use the ImageNet dataset to demonstrate that: (I) The initial data loading phase is redundant and can be completely avoided; (II) Sampling on the CPU can be entirely overlapped with training on the GPU to achieve near full occupancy; (III) CPU processing cycles and memory usage can be reduced by more than 90%, allowing other applications to be run simultaneously; (IV) Constant training step time can be achieved, regardless of the size of the partition, for up to 1.3 million records in our experiments. We expect to obtain the same improvements in other RDD transformations via further research on finer-grained implicit & explicit dataset relations.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 |
Publisher | Association for Computing Machinery |
ISBN (Electronic) | 9781450357036 |
DOIs | |
Publication status | Published - 15 Jun 2018 |
Event | 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 - Houston, United States Duration: 15 Jun 2018 → … |
Publication series
Name | Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 |
---|
Conference
Conference | 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2018 |
---|---|
Country/Territory | United States |
City | Houston |
Period | 15/06/18 → … |
Bibliographical note
Publisher Copyright:© 2018 ACM.
Keywords
- Apache spark
- Graphical processing units
- Heterogeneous architecture
- Random sampling
- Resilient distributed dataset
ASJC Scopus subject areas
- Hardware and Architecture