An Efficient Task-based All-Reduce for Machine Learning Applications

Zhenyu Li; James Davis; Stephen Jarvis

doi:10.1145/3146347.3146350

An Efficient Task-based All-Reduce for Machine Learning Applications

Zhenyu Li, James Davis, Stephen Jarvis

Engineering and Physical Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Citations (Scopus)

Abstract

All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative allreduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10 ⁸ . This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speedups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.

Original language	English
Title of host publication	Proceedings of MLHPC 2017
Subtitle of host publication	Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher	Association for Computing Machinery
ISBN (Electronic)	9781450351379
DOIs	https://doi.org/10.1145/3146347.3146350
Publication status	Published - 12 Nov 2017
Event	2017 Machine Learning in HPC Environments, MLHPC 2017 - Denver, United States Duration: 12 Nov 2017 → 17 Nov 2017

Publication series

Name	Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference	2017 Machine Learning in HPC Environments, MLHPC 2017
Country/Territory	United States
City	Denver
Period	12/11/17 → 17/11/17

Bibliographical note

Funding Information:
This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1).

Publisher Copyright:
© 2017 Association for Computing Machinery.

Keywords

Apache Spark
Butterfly All-Reduce
Data-flow Frameworks
Synchronous Model Training

ASJC Scopus subject areas

Computer Networks and Communications
Artificial Intelligence

Access to Document

10.1145/3146347.3146350

Cite this

Li, Z., Davis, J., & Jarvis, S. (2017). An Efficient Task-based All-Reduce for Machine Learning Applications. In Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis Article a2 (Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis). Association for Computing Machinery . https://doi.org/10.1145/3146347.3146350

Li, Zhenyu ; Davis, James ; Jarvis, Stephen. / An Efficient Task-based All-Reduce for Machine Learning Applications. Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery , 2017. (Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis).

@inproceedings{2512166178454fd5bb157383cabea062,

title = "An Efficient Task-based All-Reduce for Machine Learning Applications",

abstract = " All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative allreduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10 8 . This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speedups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets. ",

keywords = "Apache Spark, Butterfly All-Reduce, Data-flow Frameworks, Synchronous Model Training",

author = "Zhenyu Li and James Davis and Stephen Jarvis",

note = "Funding Information: This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1). Publisher Copyright: {\textcopyright} 2017 Association for Computing Machinery.; 2017 Machine Learning in HPC Environments, MLHPC 2017 ; Conference date: 12-11-2017 Through 17-11-2017",

year = "2017",

month = nov,

day = "12",

doi = "10.1145/3146347.3146350",

language = "English",

series = "Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis",

publisher = "Association for Computing Machinery ",

booktitle = "Proceedings of MLHPC 2017",

}

Li, Z, Davis, J & Jarvis, S 2017, An Efficient Task-based All-Reduce for Machine Learning Applications. in Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis., a2, Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery , 2017 Machine Learning in HPC Environments, MLHPC 2017, Denver, United States, 12/11/17. https://doi.org/10.1145/3146347.3146350

An Efficient Task-based All-Reduce for Machine Learning Applications. / Li, Zhenyu; Davis, James; Jarvis, Stephen.
Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery , 2017. a2 (Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - An Efficient Task-based All-Reduce for Machine Learning Applications

AU - Li, Zhenyu

AU - Davis, James

AU - Jarvis, Stephen

N1 - Funding Information: This research is supported by Atos IT Services UK Ltd and by the EPSRC Centre for Doctoral Training in Urban Science and Progress (grant no. EP/L016400/1). Publisher Copyright: © 2017 Association for Computing Machinery.

PY - 2017/11/12

Y1 - 2017/11/12

N2 - All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative allreduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10 8 . This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speedups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.

AB - All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative allreduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 10 8 . This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speedups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets.

KW - Apache Spark

KW - Butterfly All-Reduce

KW - Data-flow Frameworks

KW - Synchronous Model Training

UR - http://www.scopus.com/inward/record.url?scp=85058342243&partnerID=8YFLogxK

U2 - 10.1145/3146347.3146350

DO - 10.1145/3146347.3146350

M3 - Conference contribution

AN - SCOPUS:85058342243

T3 - Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis

BT - Proceedings of MLHPC 2017

PB - Association for Computing Machinery

T2 - 2017 Machine Learning in HPC Environments, MLHPC 2017

Y2 - 12 November 2017 through 17 November 2017

ER -

Li Z, Davis J, Jarvis S. An Efficient Task-based All-Reduce for Machine Learning Applications. In Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis. Association for Computing Machinery . 2017. a2. (Proceedings of MLHPC 2017: Machine Learning in HPC Environments - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis). doi: 10.1145/3146347.3146350

An Efficient Task-based All-Reduce for Machine Learning Applications

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this