Abstract
Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models - this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91 % compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.
Original language | English |
---|---|
Title of host publication | Proceedings of MLHPC 2018 |
Subtitle of host publication | Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Pages | 95-105 |
Number of pages | 11 |
ISBN (Electronic) | 9781728101804 |
DOIs | |
Publication status | Published - 8 Feb 2019 |
Event | 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 - Dallas, United States Duration: 12 Nov 2018 → … |
Publication series
Name | Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis |
---|
Conference
Conference | 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 |
---|---|
Country/Territory | United States |
City | Dallas |
Period | 12/11/18 → … |
Bibliographical note
Funding Information:This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1).
Publisher Copyright:
© 2018 IEEE.
Keywords
- All-Reduce
- Apache Spark
- Asynchronous Stochastic Gradient Descent
- High Performance Computing
- Machine Learning
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Networks and Communications