PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

18/07/2021

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr

Keywords: Theory, Deep Learning, Recurrent Networks; Theory, Control Theory, Applications, Hardware and Systems

Abstract Paper Similar Papers

Abstract: The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to 2.83-fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available at https://DistML.ai.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

12/07/2020

Improving Transformer Optimization Through Better Initialization

Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs

Keywords Paper

Sequential, Network, and Time-Series Modeling

0

0

0

0

14:52

06/12/2021

Searching for Efficient Transformers for Language Modeling

David So, Wojciech Mańke, Hanxiao Liu and
Zihang Dai, Noam Shazeer, Quoc V Le

Keywords Paper

transformers, language

0

0

0

0

13:29

06/12/2021

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement

Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase and
Feng Yan, Lei Yang, Yuxiong He

Keywords Paper

machine learning

0

0

0

0

11:23

06/12/2021

Augmented Shortcuts for Vision Transformers

Yehui Tang, Kai Han, Chang Xu and
An Xiao, Yiping Deng, Chao Xu, Yunhe Wang

Keywords Paper

transformers, vision

0

0

0

0

7:28

03/05/2021

Growing Efficient Deep Networks by Structured Continuous Sparsification

Xin Yuan, Pedro Savarese, Michael Maire

Keywords Paper

network pruning, computer vision, deep learning, neural architecture search

0

0

0

0

16:52

18/07/2021

Dataset Condensation with Differentiable Siamese Augmentation

Bo Zhao, Hakan Bilen

Keywords Paper

Algorithms, Multitask, Transfer, and Meta Learning

0

0

0

0

5:02

26/04/2020

DiffTaichi: Differentiable Programming for Physical Simulation

Yuanming Hu, Luke Anderson, Tzu-Mao Li and
Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, Fredo Durand

Keywords Paper

Differentiable programming, robotics, optimal control, physical simulation, machine learning system

0

0

0

0

5:01

12/07/2020

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

Tyler Johnson, Pulkit Agrawal, Haijie Gu, Carlos Guestrin

Keywords Paper

Optimization - Large Scale, Parallel and Distributed

0

0

0

0

14:22

06/12/2021

BulletTrain: Accelerating Robust Neural Network Training via Boundary Example Mining

Weizhe Hua, Yichi Zhang, Chuan Guo and
Zhiru Zhang, G. Edward Suh

Keywords Paper

deep learning, machine learning, robustness, adversarial robustness and security

0

0

0

0

6:36

03/05/2021

Large Batch Simulation for Deep Reinforcement Learning

Brennan Shacklett, Erik Wijmans, Aleksei Petrenko and
Manolis Savva, Dhruv Batra, Vladlen Koltun, Kayvon Fatahalian

Keywords Paper

reinforcement learning, simulation

0

0

0

0

5:29

11/08/2020

A computational approach to packet classification

Alon Rashelbach, Ori Rottenstreich, Mark Silberstein

Keywords Paper

Neural Networks, Virtual Switches, Packet Classification

0

0

0

0

16:56

26/04/2020

Reformer: The Efficient Transformer

Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya

Keywords Paper

attention, locality sensitive hashing, reversible layers

0

0

0

0

14:23

14/06/2020

Deep Non-Line-of-Sight Reconstruction

Javier Grau Chopite, Matthias B. Hullin, Michael Wand, Julian Iseringhausen

Keywords Paper

non-line-of-sight, time-of-flight, transient imaging, deep learning, geometry reconstruction, synthetic training

0

0

0

0

1:00

26/04/2020

Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan, Edouard Grave, Armand Joulin

Keywords Paper

reduction, regularization, pruning, dropout, transformer

0

0

0

0

5:01

02/02/2021

DPFPS: Dynamic and Progressive Filter Pruning for Compressing Convolutional Neural Networks from Scratch

Xiaofeng Ruan, Yufan Liu, Bing Li and
Chunfeng Yuan, Weiming Hu

Keywords Paper

0

0

0

0

14:38

03/05/2021

HyperGrid Transformers: Towards A Single Model for Multiple Tasks

Yi Tay, Zhe Zhao, Dara Bahri and
Donald Metzler, DA-CHENG Juan

Keywords Paper

Transformers, Multi-Task Learning

0

0

0

0

5:14

25/07/2020

Distributed equivalent substitution training for large-scale recommender systems

Haidong Rong, Yangzihao Wang, Feihu Zhou and
Junjie Zhai, Haiyang Wu, Rui Lan, Fan Li, Han Zhang, Yuekui Yang, Zhenyu Guo, Di Wang

Keywords Paper

ranking systems, synchronous training, recommender systems, dynamic sparse features

0

0

0

0

17:34

18/07/2021

Optimization Planning for 3D ConvNets

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

Keywords Paper

Applications, Activity and Event Recognition

0

0

0

0

5:13

06/12/2021

BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer

Haoping Bai, Meng Cao, Ping Huang, Jiulong Shan

Keywords Paper

deep learning, optimization

0

0

0

0

4:12

03/05/2021

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu and
Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Keywords Paper

0

0

0

0

5:07

14/06/2020

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

Tianzhe Wang, Kuan Wang, Han Cai and
Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, Song Han

Keywords Paper

efficiency, model compression, joint design, neural architecture search, channel pruning, mixed-precision quantization

0

0

0

0

1:00

26/04/2020

Learned step size quantization

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani and
Rathinakumar Appuswamy, Dharmendra S. Modha

Keywords Paper

deep learning, low precision, classification, quantization

0

0

0

0

4:40

19/04/2021

Multi-split reversible transformers can enhance neural machine translation

Yuekai Zhao, Shuchang Zhou, Zhihua Zhang

Keywords Paper

0

0

0

0

12:00

06/12/2020

ShiftAddNet: A Hardware-Inspired Deep Network

Haoran You, Xiaohan Chen, Yongan Zhang and
Chaojian Li, Sicheng Li, Zihao Liu, Zhangyang Wang, Yingyan Lin

Keywords Paper

0

0

0

0

3:25

05/04/2021

PipeMare: Asynchronous Pipeline Parallel DNN Training

Bowen Yang, Jian Zhang, Jonathan Li and
Christopher Re, Christopher Aberger, Christopher De Sa

Keywords Paper

0

0

0

0

16:57

18/07/2021

On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths

Quynh Nguyen

Keywords Paper

Theory, Deep learning Theory

0

0

0

0

4:43

06/12/2021

Piper: Multidimensional Planner for DNN Parallelization

Jakub Tarnawski, Deepak Narayanan, Amar Phanishayee

Keywords Paper

optimization, transformers, graph learning

0

0

0

0

11:47

06/12/2020

Deep Transformers with Latent Depth

Xian Li, Asa Cooper Stickland, Yuqing Tang, Xiang Kong

Keywords Paper

0

0

0

0

3:17

02/02/2021

*-CFQ: Analyzing the Scalability of Machine Learning on a Compositional Task

Dmitry Tsarkov, Tibor Tihon, Nathan Scales and
Nikola Momchev, Danila Sinopalnikov, Nathanael Schärli

Keywords Paper

0

0

0

0

16:33

06/12/2020

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Minjia Zhang, Yuxiong He

Keywords Paper

0

0

0

0

3:18

06/12/2021

IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Bowen Pan, Rameswar Panda, Yifan Jiang and
Zhangyang Wang, Rogerio Feris, Aude Oliva

Keywords Paper

deep learning, transformers, vision, interpretability

0

0

0

0

11:25

26/04/2020

Once for All: Train One Network and Specialize it for Efficient Deployment

Han Cai, Chuang Gan, Tianzhe Wang and
Zhekai Zhang, Song Han

Keywords Paper

Efficient Deep Learning, Specialized Neural Network Architecture, AutoML

0

0

0

0

4:53

15/11/2020

A Modular Cost Analysis for Probabilistic Programs

Martin Avanzini, Georg Moser, Michael Schaper

Keywords Paper

probabilistic programs, automation, average complexity, modularity

0

0

0

0

14:58

13/04/2021

Approximate data deletion from machine learning models

Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, James Zou

Keywords Paper

0

0

0

0

3:18

03/05/2021

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Yuhang Li, Ruihao Gong, Xu Tan and
Yang Yang, Peng Hu, Qi Zhang, fengwei yu, Wei Wang, Shi Gu

Keywords Paper

Second-order analysis, Mixed Precision, Post Training Quantization

0

0

0

0

4:36

06/12/2021

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare and
Shafiq Joty, Caiming Xiong, Steven Chu Hong Hoi

Keywords Paper

transformers, vision, representation learning

0

0

0

0

9:40

06/12/2020

LAPAR: Linearly-Assembled Pixel-Adaptive Regression Network for Single Image Super-resolution and Beyond

Wenbo Li, Kun Zhou, lu Qi and
Nianjuan Jiang, Jiangbo Lu, Jiaya Jia

Keywords Paper

0

0

0

0

3:09

18/07/2021

Memory-Efficient Pipeline-Parallel DNN Training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi and
Xie Chen, Matei Zaharia

Keywords Paper

Applications, Hardware and Systems

0

0

0

0

5:37

18/07/2021

A Free Lunch From ANN: Towards Efficient, Accurate Spiking Neural Networks Calibration

Yuhang Li, Shikuang Deng, Xin Dong and
Ruihao Gong, Shi Gu

Keywords Paper

Deep Learning, Architectures

0

0

0

0

4:42