Sparse is Enough in Scaling Transformers

Abstract: Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly - the sparse layers are enough to obtain the same perplexity as the standard Transformer. We also integrate with prior sparsity approaches to enable fast inference on long sequences even with limited memory, resulting in performance competitive to the state-of-the-art on long text summarization.

12/07/2020

Sparse is Enough in Scaling Transformers

Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

Comments

Similar Papers

Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Zhuohan Li, Eric Wallace, Sheng Shen and Kevin Lin, Kurt Keutzer, Dan Klein, Joseph Gonzalez

Keywords Abstract Paper

Applications - Language, Speech and Dialog

GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification

John Halloran, David M Rocke

Keywords Abstract Paper

Minimizing FLOPs to Learn Efficient Sparse Representations

Biswajit Paria, Chih-Kuan Yeh, Ian E.H. Yen and Ning Xu, Pradeep Ravikumar, Barnabás Póczos

Keywords Abstract Paper

sparse embeddings, deep representations, metric learning, regularization

Top-KAST: Top-K Always Sparse Training

Sid Jayakumar, Razvan Pascanu, Jack Rae and Simon Osindero, Erich Elsen

Keywords Abstract Paper

Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Aliakbar Panahi, Seyran Saeedi, Tom Arodz

Keywords Abstract Paper

transformers

Incremental Sensitivity Analysis for Kernelized Models

Hadar Sivan, Moshe Gabel, Assaf Schuster

Keywords Abstract Paper

Approximate Cross-Validation with Low-Rank Data in High Dimensions

Will Stephenson, Madeleine Udell, Tamara Broderick

Keywords Abstract Paper

Error Estimation for Sketched SVD

Miles Lopes, N. Benjamin Erichson, Michael Mahoney

Keywords Abstract Paper

Probabilistic Inference - Approximate, Monte Carlo, and Spectral Methods

Paying more Attention to Snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Duong Le, Nhan Vo, Nam Thoai

Keywords Abstract Paper

network pruning, knowledge distillation, ensemble learning

A Novel Sequential Coreset Method for Gradient Descent Algorithms

Jiawei Huang, Ruomin Huang, wenjie liu and Nikolaos Freris, Hu Ding

Keywords Abstract Paper

Optimization

SOLAR: Sparse Orthogonal Learned and Random Embeddings

Tharun Medini Medini, Beidi Chen, Anshumali Shrivastava

Keywords Abstract Paper

Embedding Models, Learning to Hash, Inverted Index, Sparse Embedding

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Elbruz Ozen, Alex Orailoglu

Keywords Abstract Paper

deep learning, information redundancy, pruning

Dynamic Model Pruning with Feedback

Tao Lin, Sebastian U. Stich, Luis Barba and Daniil Dmitriev, Martin Jaggi

Keywords Abstract Paper

network pruning, dynamic reparameterization, model compression

Faster & more reliable tuning of neural networks: Bayesian optimization with importance sampling

Setareh Ariafar, Zelda Mariet, Dana Brooks and Jennifer Dy, Jasper Snoek

Keywords Abstract Paper

Naive Feature Selection: Sparsity in Naive Bayes

Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui

Keywords Abstract Paper

Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits

Jack Parker-Holder, Vu Nguyen, Stephen J Roberts

Keywords Abstract Paper

Learning fast and precise numerical analysis

Jingxuan He, Gagandeep Singh, Markus Püschel, Martin Vechev

Keywords Abstract Paper

Abstract interpretation, Performance optimization, Machine learning, Numerical domains

Exponential convergence rates of classification errors on learning with SGD and random features

Shingo Yashima, Atsushi Nitanda, Taiji Suzuki

Keywords Abstract Paper

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement

Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase and Feng Yan, Lei Yang, Yuxiong He

Keywords Abstract Paper

machine learning

SEDONA: Search for Decoupled Neural Networks toward Greedy Block-wise Learning

Myeongjang Pyeon, Jihwan Moon, Taeyoung Hahn, Gunhee Kim

Keywords Abstract Paper

AutoML, Greedy Learning, Deep Learning, Neural Architecture Search

Robust Meta-learning for Mixed Linear Regression with Small Batches

Weihao Kong, Raghav Somani, Sham Kakade, Sewoong Oh

Keywords Abstract Paper

BulletTrain: Accelerating Robust Neural Network Training via Boundary Example Mining

Zhuohan Li, Eric Wallace, Sheng Shen and
Kevin Lin, Kurt Keutzer, Dan Klein, Joseph Gonzalez

Keywords Paper

Keywords Paper

Biswajit Paria, Chih-Kuan Yeh, Ian E.H. Yen and
Ning Xu, Pradeep Ravikumar, Barnabás Póczos

Keywords Paper

Sid Jayakumar, Razvan Pascanu, Jack Rae and
Simon Osindero, Erich Elsen

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Jiawei Huang, Ruomin Huang, wenjie liu and
Nikolaos Freris, Hu Ding

Keywords Paper

Keywords Paper

Keywords Paper

Tao Lin, Sebastian U. Stich, Luis Barba and
Daniil Dmitriev, Martin Jaggi

Keywords Paper

Setareh Ariafar, Zelda Mariet, Dana Brooks and
Jennifer Dy, Jasper Snoek

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Heyang Qin, Samyam Rajbhandari, Olatunji Ruwase and
Feng Yan, Lei Yang, Yuxiong He

Keywords Paper

Keywords Paper

Keywords Paper

Weizhe Hua, Yichi Zhang, Chuan Guo and
Zhiru Zhang, G. Edward Suh

Keywords Paper

Xiao Zhou, Weizhong Zhang, Zonghao Chen and
SHIZHE DIAO, Tong Zhang

Keywords Paper

Keywords Paper

Keywords Paper

Mark Kurtz, Justin Kopinsky, Rati Gelashvili and
Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Beidi Chen, Tri Dao, Eric Winsor and
Zhao Song, Atri Rudra, Christopher Ré

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Tobias Grosser, Theodoros Theodoridis, Maximilian Falkenstein and
Arjun Pitchanathan, Michael Kruse, Manuel Rigger, Zhendong Su, Torsten Hoefler

Keywords Paper