SAFA: a semi-asynchronous protocol for fast federated learning with low overhead

Research output: Contribution to journalArticlepeer-review

Standard

SAFA : a semi-asynchronous protocol for fast federated learning with low overhead. / Wu, Wentai; He, Ligang; Lin, Weiwei; Mao, Rui; Maple, Carsten; Jarvis, Stephen.

In: IEEE Transactions on Computers, Vol. 70, No. 5, 9093123, 01.05.2021, p. 655-668.

Research output: Contribution to journalArticlepeer-review

Harvard

APA

Vancouver

Author

Wu, Wentai ; He, Ligang ; Lin, Weiwei ; Mao, Rui ; Maple, Carsten ; Jarvis, Stephen. / SAFA : a semi-asynchronous protocol for fast federated learning with low overhead. In: IEEE Transactions on Computers. 2021 ; Vol. 70, No. 5. pp. 655-668.

Bibtex

@article{9d862fd0097d4b188e37fcd0af9b1e1d,
title = "SAFA: a semi-asynchronous protocol for fast federated learning with low overhead",
abstract = "Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost. ",
keywords = "Distributed computing, edge intelligence, federated learning, machine learning",
author = "Wentai Wu and Ligang He and Weiwei Lin and Rui Mao and Carsten Maple and Stephen Jarvis",
note = "Funding Information: This work was supported in part by Worldwide Byte Security Information Technology Company ltd., in part by National Natural Science Foundation of China under Grant 61772205, in part by Guangzhou Development Zone Science and Technology under Grant 2018GH17, in part by Major Program and of Guangdong Basic and Applied Research under Grant 2019B030302002, in part by Guangdong project under Grant 2017B030314073 and Grant 2018B030325002, in part by the EPSRC Centre for Doctoral Training in Urban Science under EPSRC Grant EP/L016400/1, in part by the Alan Turing Institute under EPSRC Grant EP/N510129/1 and Grant PETRAS, and in part by the National Center of Excellence for IoT Systems Cybersecurity under Grant EP/ S035362/1. Publisher Copyright: {\textcopyright} 1968-2012 IEEE.",
year = "2021",
month = may,
day = "1",
doi = "10.1109/TC.2020.2994391",
language = "English",
volume = "70",
pages = "655--668",
journal = "IEEE Transactions on Computers",
issn = "0018-9340",
publisher = "Institute of Electrical and Electronics Engineers (IEEE)",
number = "5",

}

RIS

TY - JOUR

T1 - SAFA

T2 - a semi-asynchronous protocol for fast federated learning with low overhead

AU - Wu, Wentai

AU - He, Ligang

AU - Lin, Weiwei

AU - Mao, Rui

AU - Maple, Carsten

AU - Jarvis, Stephen

N1 - Funding Information: This work was supported in part by Worldwide Byte Security Information Technology Company ltd., in part by National Natural Science Foundation of China under Grant 61772205, in part by Guangzhou Development Zone Science and Technology under Grant 2018GH17, in part by Major Program and of Guangdong Basic and Applied Research under Grant 2019B030302002, in part by Guangdong project under Grant 2017B030314073 and Grant 2018B030325002, in part by the EPSRC Centre for Doctoral Training in Urban Science under EPSRC Grant EP/L016400/1, in part by the Alan Turing Institute under EPSRC Grant EP/N510129/1 and Grant PETRAS, and in part by the National Center of Excellence for IoT Systems Cybersecurity under Grant EP/ S035362/1. Publisher Copyright: © 1968-2012 IEEE.

PY - 2021/5/1

Y1 - 2021/5/1

N2 - Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost.

AB - Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this article, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost.

KW - Distributed computing

KW - edge intelligence

KW - federated learning

KW - machine learning

UR - http://www.scopus.com/inward/record.url?scp=85104096045&partnerID=8YFLogxK

U2 - 10.1109/TC.2020.2994391

DO - 10.1109/TC.2020.2994391

M3 - Article

AN - SCOPUS:85104096045

VL - 70

SP - 655

EP - 668

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

SN - 0018-9340

IS - 5

M1 - 9093123

ER -