Optimizing Machine Learning on Apache Spark in HPC Environments

Zhenyu Li; James Davis; Stephen A. Jarvis

doi:10.1109/MLHPC.2018.8638643

Optimizing Machine Learning on Apache Spark in HPC Environments

Zhenyu Li, James Davis, Stephen A. Jarvis

Engineering and Physical Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models - this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91 % compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.

Original language	English
Title of host publication	Proceedings of MLHPC 2018
Subtitle of host publication	Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher	Institute of Electrical and Electronics Engineers (IEEE)
Pages	95-105
Number of pages	11
ISBN (Electronic)	9781728101804
DOIs	https://doi.org/10.1109/MLHPC.2018.8638643
Publication status	Published - 8 Feb 2019
Event	2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 - Dallas, United States Duration: 12 Nov 2018 → …

Publication series

Name	Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference	2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018
Country/Territory	United States
City	Dallas
Period	12/11/18 → …

Bibliographical note

Funding Information:
This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1).

Publisher Copyright:
© 2018 IEEE.

Keywords

All-Reduce
Apache Spark
Asynchronous Stochastic Gradient Descent
High Performance Computing
Machine Learning

ASJC Scopus subject areas

Artificial Intelligence
Computer Networks and Communications

Access to Document

10.1109/MLHPC.2018.8638643

Cite this

Li, Z., Davis, J., & Jarvis, S. A. (2019). Optimizing Machine Learning on Apache Spark in HPC Environments. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 95-105). Article 8638643 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/MLHPC.2018.8638643

Li, Zhenyu ; Davis, James ; Jarvis, Stephen A. / Optimizing Machine Learning on Apache Spark in HPC Environments. Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers (IEEE), 2019. pp. 95-105 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).

@inproceedings{2209945598a84aafb2f975345c502f49,

title = "Optimizing Machine Learning on Apache Spark in HPC Environments",

abstract = "Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models - this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91 % compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.",

keywords = "All-Reduce, Apache Spark, Asynchronous Stochastic Gradient Descent, High Performance Computing, Machine Learning",

author = "Zhenyu Li and James Davis and Jarvis, {Stephen A.}",

note = "Funding Information: This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1). Publisher Copyright: {\textcopyright} 2018 IEEE.; 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018 ; Conference date: 12-11-2018",

year = "2019",

month = feb,

day = "8",

doi = "10.1109/MLHPC.2018.8638643",

language = "English",

series = "Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

pages = "95--105",

booktitle = "Proceedings of MLHPC 2018",

}

Li, Z, Davis, J & Jarvis, SA 2019, Optimizing Machine Learning on Apache Spark in HPC Environments. in Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis., 8638643, Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE), pp. 95-105, 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018, Dallas, United States, 12/11/18. https://doi.org/10.1109/MLHPC.2018.8638643

Optimizing Machine Learning on Apache Spark in HPC Environments. / Li, Zhenyu; Davis, James; Jarvis, Stephen A.
Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers (IEEE), 2019. p. 95-105 8638643 (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Optimizing Machine Learning on Apache Spark in HPC Environments

AU - Li, Zhenyu

AU - Davis, James

AU - Jarvis, Stephen A.

N1 - Funding Information: This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1). Publisher Copyright: © 2018 IEEE.

PY - 2019/2/8

Y1 - 2019/2/8

N2 - Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models - this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91 % compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.

AB - Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models - this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based all-reduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82-91 % compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.

KW - All-Reduce

KW - Apache Spark

KW - Asynchronous Stochastic Gradient Descent

KW - High Performance Computing

KW - Machine Learning

UR - http://www.scopus.com/inward/record.url?scp=85063045676&partnerID=8YFLogxK

U2 - 10.1109/MLHPC.2018.8638643

DO - 10.1109/MLHPC.2018.8638643

M3 - Conference contribution

AN - SCOPUS:85063045676

T3 - Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

SP - 95

EP - 105

BT - Proceedings of MLHPC 2018

PB - Institute of Electrical and Electronics Engineers (IEEE)

T2 - 2018 IEEE/ACM Machine Learning in HPC Environments, MLHPC 2018

Y2 - 12 November 2018

ER -

Li Z, Davis J, Jarvis SA. Optimizing Machine Learning on Apache Spark in HPC Environments. In Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers (IEEE). 2019. p. 95-105. 8638643. (Proceedings of MLHPC 2018: Machine Learning in HPC Environments, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis). doi: 10.1109/MLHPC.2018.8638643

Optimizing Machine Learning on Apache Spark in HPC Environments

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this