TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

18/07/2021

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

Keywords: Algorithms, Large Scale Learning

Abstract Paper Similar Papers

Abstract: Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

11/08/2020

A computational approach to packet classification

Alon Rashelbach, Ori Rottenstreich, Mark Silberstein

Keywords Paper

Neural Networks, Virtual Switches, Packet Classification

0

0

0

0

16:56

06/12/2021

RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem

Eric Liang, Zhanghao Wu, Michael Luo and
Sven Mika, Joseph Gonzalez, Ion Stoica

Keywords Paper

reinforcement learning and planning

0

0

0

0

12:15

19/08/2021

Improving Stylized Neural Machine Translation with Iterative Dual Knowledge Transfer

Xuanxuan Wu, Jian Liu, Xinjie Li and
Jinan Xu, Yufeng Chen, Yujie Zhang, Hui Huang

Keywords Paper

Natural Language Processing, Machine Translation, Natural Language Generation

0

0

0

0

12:35

15/06/2020

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Jay H. Park, Gyeongchan Yun, Chang M. Yi and
Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi, Sam H. Noh, Young-ri Choi

Keywords Paper

0

0

0

0

20:19

03/05/2021

Large Batch Simulation for Deep Reinforcement Learning

Brennan Shacklett, Erik Wijmans, Aleksei Petrenko and
Manolis Savva, Dhruv Batra, Vladlen Koltun, Kayvon Fatahalian

Keywords Paper

reinforcement learning, simulation

0

0

0

0

5:29

16/11/2020

AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network

Xinyu Wang, Yong Jiang, Nguyen Bach and
Tao Wang, Zhongqiang Huang, Fei Huang, Kewei Tu

Keywords Paper

parallelization, faster prediction, linear-chain model, neural approaches

0

0

0

0

7:05

06/12/2021

Preserved central model for faster bidirectional compression in distributed settings

Constantin Philippenko, Aymeric Dieuleveut

Keywords Paper

optimization, federated learning

0

0

0

0

13:29

12/07/2020

TaskNorm: Rethinking Batch Normalization for Meta-Learning

John Bronskill, Jonathan Gordon, James Requeima and
Sebastian Nowozin, Richard Turner

Keywords Paper

Transfer, Multitask and Meta-learning

0

0

0

0

13:56

06/12/2020

Accelerating Reinforcement Learning through GPU Atari Emulation

Steven Dalton, iuri frosio

Keywords Paper

0

0

0

0

3:12

18/07/2021

Parallelizing Legendre Memory Unit Training

Narsimha Reddy Chilkuri, Chris Eliasmith

Keywords Paper

Deep Learning, Architectures

0

0

0

0

5:13

02/02/2021

Deterministic Mini-batch Sequencing for Training Deep Neural Networks

Subhankar Banerjee, Shayok Chakraborty

Keywords Paper

0

0

0

0

16:00

06/12/2021

Neural Routing by Memory

Kaipeng Zhang, Zhenqiang Li, Zhifeng Li and
Wei Liu, Yoichi Sato

Keywords Paper

deep learning

0

0

0

0

6:41

06/12/2021

Integrating Tree Path in Transformer for Code Representation

Han Peng, Ge Li, Wenhan Wang and
YunFei Zhao, Zhi Jin

Keywords Paper

machine learning, transformers

0

0

0

0

4:42

23/06/2021

AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations

Jie Zhao, Bojie Li, Wang Nie and
Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, Xuefeng Jin

Keywords Paper

neural networks, neural processing units, polyhedral model, code generation, auto-tuning

0

0

0

0

21:49

05/04/2021

Pipelined Backpropagation at Scale: Training Large Models without Batches

Atli Kosson, Vitaliy Chiley, Abhi Venigalla and
Joel Hestness, Urs Koster

Keywords Paper

0

0

0

0

18:00

05/04/2021

Pipelined Backpropagation at Scale: Training Large Models without Batches

Atli Kosson, Vitaliy Chiley, Abhi Venigalla and
Joel Hestness, Urs Koster

Keywords Paper

0

0

0

0

4:14

01/07/2020

Re-translation versus Streaming for Simultaneous Translation

Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, George Foster

Keywords Paper

0

0

0

0

23:21

23/08/2020

AutoML pipeline selection: Efficiently navigating the combinatorial space

Chengrun Yang, Jicong Fan, Ziyang Wu, Madeleine Udell

Keywords Paper

pipeline search, greedy algorithms, experiment design, AutoML, tensor decomposition, submodular optimization, meta-learning

0

0

0

0

13:40

18/07/2021

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models

Chaoyang He, Shen Li, Mahdi Soltanolkotabi, Salman Avestimehr

Keywords Paper

Theory, Deep Learning, Recurrent Networks; Theory, Control Theory, Applications, Hardware and Systems

0

0

0

0

5:13

06/12/2020

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

Tan Nguyen, Richard Baraniuk, Andrea Bertozzi and
Stanley Osher, Bao Wang

Keywords Paper

0

0

0

0

3:09

04/07/2020

Dynamic Memory Induction Networks for Few-Shot Text Classification

Ruiying Geng, Binhua Li, Yongbin Li and
Jian Sun, Xiaodan Zhu

Keywords Paper

Few-Shot Classification, few-short classification, Dynamic Networks, Dynamic DMIN

0

0

0

0

6:13

03/08/2020

Extending the wait-free hierarchy to multi-threaded systems

Matthieu Perrin, Achour Mostéfaoui, Grégoire Bonin

Keywords Paper

consensus number, linearizability, universality, memory allocation, multi-threaded system, wait-freedom, arrival models

0

0

0

0

20:23

08/12/2020

E.T.: Entity-Transformers. Coreference augmented Neural Language Model for richer mention representations via Entity-Transformer blocks

Nikolaos Stylianou, Ioannis Vlahavas

Keywords Paper

0

0

0

0

8:49

03/05/2021

Understanding the effects of data parallelism and sparsity on neural network training

Namhoon Lee, Thalaiyasingam Ajanthan, Philip Torr, Martin Jaggi

Keywords Paper

sparsity, neural network training, data parallelism

0

0

0

0

4:52

05/04/2021

PipeMare: Asynchronous Pipeline Parallel DNN Training

Bowen Yang, Jian Zhang, Jonathan Li and
Christopher Re, Christopher Aberger, Christopher De Sa

Keywords Paper

0

0

0

0

16:57

26/04/2020

Learning-Augmented Data Stream Algorithms

Tanqiu Jiang, Yi Li, Honghao Lin and
Yisong Ruan, David P. Woodruff

Keywords Paper

streaming algorithms, heavy hitters, F_p moment, distinct elements, cascaded norms

0

0

0

0

3:55

18/07/2021

Improving Generalization in Meta-learning via Task Augmentation

Huaxiu Yao, Long-Kai Huang, Linjun Zhang and
Ying WEI, Li Tian, James Zou, Junzhou Huang, Zhenhui (Jessie) Li

Keywords Paper

Algorithms, Multitask, Transfer, and Meta Learning

0

0

0

0

8:27

14/09/2020

Learning a Sequence of Sentiment Classification Tasks

Zixuan Ke, Bing Liu, Hao Wang, Lei Shu

Keywords Paper

0

0

0

0

14:23

02/02/2021

LRSC: Learning Representations for Subspace Clustering

Changsheng Li, Chen Yang, Bo Liu and
Ye Yuan, Guoren Wang

Keywords Paper

0

0

0

0

15:09

14/06/2020

Meshed-Memory Transformer for Image Captioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

Keywords Paper

transformer, image captioning, vision and language, fully-attentive models, mesh connectivity, memory vectors, self-attention

0

0

0

0

1:00

03/05/2021

Few-Shot Bayesian Optimization with Deep Kernel Surrogates

Martin Wistuba, Josif Grabocka

Keywords Paper

automl, bayesian optimization, metalearning, few-shot learning

0

0

0

0

5:18

04/07/2020

Extractive Summarization as Text Matching

Ming Zhong, Pengfei Liu, Yiran Chen and
Danqing Wang, Xipeng Qiu, Xuanjing Huang

Keywords Paper

Extractive Summarization, Text Matching, extractive task, semantic problem

0

0

0

0

11:44

18/07/2021

Optimization Planning for 3D ConvNets

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

Keywords Paper

Applications, Activity and Event Recognition

0

0

0

0

5:13

14/09/2020

Incremental training of a recurrent neural network exploiting a multi-scale dynamic memory

Antonio Carta, Alessandro Sperduti, Davide Bacciu

Keywords Paper

recurrent neural networks, linear dynamical systems, incremental learning

0

0

0

0

15:12

04/07/2020

Multiscale Collaborative Deep Models for Neural Machine Translation

Xiangpeng Wei, Heng Yu, Yue Hu and
Yue Zhang, Rongxiang Weng, Weihua Luo

Keywords Paper

Neural Translation, training models, IWSLT tasks, WMT14 task

0

0

0

0

10:42

06/12/2021

Video Instance Segmentation using Inter-Frame Communication Transformers

Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

Keywords Paper

transformers

0

0

0

0

10:00

06/12/2020

Auto Learning Attention

Benteng Ma, Jing Zhang, Yong Xia, Dacheng Tao

Keywords Paper

Algorithms -> Representation Learning, Algorithms -> Relational Learning

0

0

0

0

3:16

06/12/2020

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

Keywords Paper

0

0

1

0

3:23

14/06/2020

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

Tianzhe Wang, Kuan Wang, Han Cai and
Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, Song Han

Keywords Paper

efficiency, model compression, joint design, neural architecture search, channel pruning, mixed-precision quantization

0

0

0

0

1:00

06/12/2020

Deep Transformers with Latent Depth

Xian Li, Asa Cooper Stickland, Yuqing Tang, Xiang Kong

Keywords Paper

0

0

0

0

3:17