Scaling Distributed Training with Adaptive Summation

05/04/2021

Scaling Distributed Training with Adaptive Summation

Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi, Tianju Xu, Vadim Eksarevskiy, Jaliya Ekanayake, Emad Barsoum

Keywords:

Abstract Paper Similar Papers

Abstract: Data parallelism is a common way to parallelize stochastic gradient descent (SGD). However, the loss of convergence at large minibatch sizes limits the scalability of data parallelism. This paper introduces a novel method to combine gradients called Adasum that significantly improves the convergence when using large minibatches. This paper provides the intuition and formal justification of Adasum along with a convergence proof. Additionally, the paper describes an efficient implementation of Adasum and its integration into the open-source toolkit Horovod for use in both TensorFlow and PyTorch. The paper empirically shows that Adasum improves convergence when using large minibatch sizes for multiple optimizers (Momentum-SGD, Adam, and LAMB). For BERT-Large training with a minibatch size of 64K, using both Adasum and LAMB training converges in 20% fewer epochs than with LAMB alone. This combination also allows BERT-Large training to scale to a 128K minibatch size. While one of the motivations for LAMB was the inability of the Adam optimizer to scale beyond a minibatch size of 16K, we show that Adasum helps Adam scale BERT-Large training to a 64K minibatch size. Our implementation of Adasum in Horovod has already been adopted in several production environments.

The video of this talk cannot be embedded. You can watch it here:

https://slideslive.com/38952712

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at MLSYS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

26/04/2020

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi and
Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh

Keywords Paper

large-batch optimization, distributed training, fast optimizer

0

0

0

0

5:02

18/07/2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan and
Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He

Keywords Paper

Algorithms, Large Scale Learning

0

0

0

0

4:53

03/05/2021

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh and
Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha

Keywords Paper

effective learning rate, normalize layer, scale-invariant weights, momentum optimizer

0

0

0

0

5:16

06/12/2020

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Pan Zhou, Jiashi Feng, Chao Ma and
Caiming Xiong, Steven Hoi, Weinan E

Keywords Paper

0

0

0

0

3:20

06/12/2020

DisARM: An Antithetic Gradient Estimator for Binary Latent Variables

Zhe Dong, Andriy Mnih, George Tucker

Keywords Paper

0

0

0

0

3:37

06/12/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Juntang Zhuang, Tommy Tang, Yifan Ding and
Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, James Duncan

Keywords Paper

0

0

0

0

3:24

02/02/2021

Near Lossless Transfer Learning for Spiking Neural Networks

Zhanglu Yan, Jun Zhou, Weng-Fai Wong

Keywords Paper

0

0

0

0

16:34

06/12/2020

Sparse Weight Activation Training

Md Aamir Raihan, Tor Aamodt

Keywords Paper

0

0

0

0

3:24

03/05/2021

CompOFA – Compound Once-For-All Networks for Faster Multi-Platform Deployment

Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

Keywords Paper

AutoML, Latency-aware Neural Architecture Search, Efficient Deep Learning

0

0

0

0

5:11

06/12/2021

SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression

Steve Yadlowsky, Taedong Yun, Cory Y McLean, Alexander D'Amour

Keywords Paper

machine learning

0

0

0

0

8:58

30/11/2020

Fast and Differentiable Message Passing on Pairwise Markov Random Fields

Zhiwei Xu, Thalaiyasingam Ajanthan, Richard Hartley

Keywords Paper

0

0

0

0

9:41

06/12/2020

Asymptotically Optimal Exact Minibatch Metropolis-Hastings

Ruqi Zhang, A. Feder Cooper, Christopher De Sa

Keywords Paper

0

0

0

0

3:24

14/09/2020

Orthogonal Mixture of Hidden Markov Models

Negar Safinianaini, Camila P. E. de Souza, Henrik Boström, Jens Lagergren

Keywords Paper

hidden markov models, mixture models, mixture of hidden markov models, expectation maximization, orthogonality, regularization, penalty

0

0

0

0

14:43

14/06/2020

GAN Compression: Efficient Architectures for Interactive Conditional GANs

Muyang Li, Ji Lin, Yaoyao Ding and
Zhijian Liu, Jun-Yan Zhu, Song Han

Keywords Paper

generative adversarial networks, model compression, distillation, neural architecture search, image and video synthesis

0

0

0

0

1:00

05/01/2021

OverNet: Lightweight Multi-Scale Super-Resolution With Overscaling Network

Parichehr Behjati, Pau Rodriguez, Armin Mehri and
Isabelle Hupont, Carles Fernandez Tena, Jordi Gonzalez

Keywords Paper

0

0

0

0

4:24

06/12/2021

Towards Stable and Robust AdderNets

Minjing Dong, Yunhe Wang, Xinghao Chen, Chang Xu

Keywords Paper

deep learning, robustness, adversarial robustness and security

0

0

0

0

11:19

02/02/2021

Fast and Compact Bilinear Pooling by Shifted Random Maclaurin

Tan Yu, Xiaoyun Li, Ping Li

Keywords Paper

0

0

0

0

14:24

06/12/2021

Collapsed Variational Bounds for Bayesian Neural Networks

Marcin Tomczak, Siddharth Swaroop, Andrew Foong, Richard Turner

Keywords Paper

deep learning, optimization, generative model

0

0

0

0

5:44

26/04/2020

Accelerating SGD with momentum for over-parameterized learning

Chaoyue Liu, Mikhail Belkin

Keywords Paper

SGD, acceleration, momentum, stochastic, over-parameterized, Nesterov

0

0

0

0

4:50

06/12/2020

Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification

Yulin Wang, Kangchen Lv, Rui Huang and
Shiji Song, Le Yang, Gao Huang

Keywords Paper

0

0

0

0

3:23

26/04/2020

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman and
Kevin Gimpel, Piyush Sharma, Radu Soricut

Keywords Paper

Natural Language Processing, BERT, Representation Learning

0

0

0

0

4:59

06/12/2021

Memory-efficient Patch-based Inference for Tiny Deep Learning

Ji Lin, Wei-Ming Chen, Han Cai and
Chuang Gan, Song Han

Keywords Paper

deep learning, machine learning, vision

0

0

0

0

11:14

12/07/2020

Almost Tune-Free Variance Reduction

Bingcong Li, Lingda Wang, Georgios B. Giannakis

Keywords Paper

Optimization - Convex

0

0

0

0

10:58

18/07/2021

Memory-Efficient Pipeline-Parallel DNN Training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi and
Xie Chen, Matei Zaharia

Keywords Paper

Applications, Hardware and Systems

0

0

0

0

5:37

06/12/2021

Handling Long-tailed Feature Distribution in AdderNets

Minjing Dong, Yunhe Wang, Xinghao Chen, Chang Xu

Keywords Paper

deep learning, machine learning

0

0

0

0

12:25

06/12/2020

Pruning Filter in Filter

Fanxu Meng, Hao Cheng, Ke Li and
Huixiang Luo, Xiaowei Guo, Guangming Lu, Xing Sun

Keywords Paper

0

0

0

0

3:05

06/12/2021

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Itay Hubara, Brian Chmiel, Moshe Island and
Ron Banner, Joseph Naor, Daniel Soudry

Keywords Paper

deep learning

0

0

0

0

11:02

06/12/2020

Rotated Binary Neural Network

Mingbao Lin, Rongrong Ji, Zihan Xu and
Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, Chia-Wen Lin

Keywords Paper

0

0

0

0

3:13

02/02/2021

Compressing Deep Convolutional Neural Networks by Stacking Low-dimensional Binary Convolution Filters

Weichao Lan, Liang Lan

Keywords Paper

0

0

0

0

14:38

06/12/2021

Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder

Keywords Paper

optimization

0

0

0

0

14:16

06/12/2020

SMYRF - Efficient Attention using Asymmetric Clustering

Giannis Daras, Nikita Kitaev, Augustus Odena, Alex Dimakis

Keywords Paper

0

0

0

0

3:28

14/06/2020

Residual Feature Aggregation Network for Image Super-Resolution

Jie Liu, Wenjie Zhang, Yuting Tang and
Jie Tang, Gangshan Wu

Keywords Paper

image super-resolution, convolutional neural network, deep learning

0

0

0

0

1:00

05/01/2021

A Variational Information Bottleneck Based Method to Compress Sequential Networks for Human Action Recognition

Ayush Srivastava, Oshin Dutta, Jigyasa Gupta and
Sumeet Agarwal, Prathosh AP

Keywords Paper

0

0

0

0

4:29

06/12/2021

Perturb-and-max-product: Sampling and learning in discrete energy-based models

Miguel Lazaro-Gredilla, Antoine Dedieu, Dileep George

Keywords Paper

generative model, graph learning

0

0

0

0

14:16

06/12/2021

Boost Neural Networks by Checkpoints

Feng Wang, Guoyizhe Wei, Qiao Liu and
Jinxiang Ou, xian wei, Hairong Lv

Keywords Paper

deep learning

1

0

0

0

4:45

02/02/2021

MFES-HB: Efficient Hyperband with Multi-Fidelity Quality Measurements

Yang Li, Yu Shen, Jiawei Jiang and
Jinyang Gao, Ce Zhang, Bin Cui

Keywords Paper

0

0

0

0

14:35

06/12/2021

Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

Bowen Zhang, Yifan liu, Zhi Tian, Chunhua Shen

Keywords Paper

deep learning, vision, representation learning

0

0

0

0

12:04

02/02/2021

Warm Starting CMA-ES for Hyperparameter Optimization

Masahiro Nomura, Shuhei Watanabe, Youhei Akimoto and
Yoshihiko Ozaki, Masaki Onishi

Keywords Paper

0

0

0

0

18:19

06/12/2021

Early Convolutions Help Transformers See Better

Tete Xiao, Piotr Dollar, Mannat Singh and
Eric Mintun, Trevor Darrell, Ross B Girshick

Keywords Paper

deep learning, optimization, transformers

0

0

0

0

9:23