Sub-Linear Memory: How to Make Performers SLiM

06/12/2021

Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov, Krzysztof Choromanski, Jared Quincy Davis, Xingyou Song, Adrian Weller

Keywords: transformers

Abstract Paper Similar Papers

Abstract: Transformer architectures have become very popular yet the original implementation requires $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory, and still requires $O(L)$ time. Due to complete backward-compatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

Kernel Methods Through the Roof: Handling Billions of Points Efficiently

Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, Alessandro Rudi

Keywords Paper

0

0

0

0

3:28

14/07/2020

How to manage high-bandwidth memory automatically

Rathish Das, Kunal Agrawal, Michael A. Bender and
Jonathan Berry, Benjamin Moseley, Cynthia A. Phillips

Keywords Paper

online algorithms, approximation algorithms, high-bandwidth memory, paging, scheduling, multicore paging

0

0

0

0

15:13

14/09/2020

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Elbruz Ozen, Alex Orailoglu

Keywords Paper

deep learning, information redundancy, pruning

0

0

0

0

14:48

03/05/2021

A Block Minifloat Representation for Training Deep Neural Networks

Sean Fox, Seyedramin Rasoulinezhad, Julian Faraone and
david boland, Philip Leong

Keywords Paper

0

0

0

0

5:15

06/12/2021

Differentiable Optimization of Generalized Nondecomposable Functions using Linear Programs

Zihang Meng, Lopamudra Mukherjee, Yichao Wu and
Vikas Singh, Sathya Narayanan Ravi

Keywords Paper

deep learning, optimization

0

0

0

0

13:21

26/04/2020

Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware

Xiandong Zhao, Ying Wang, Xuyi Cai and
Cheng Liu, Lei Zhang

Keywords Paper

quantization, integer-arithmetic-only DNN accelerator, acceleration

0

0

0

0

4:43

06/12/2021

Sparse is Enough in Scaling Transformers

Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin and
Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

Keywords Paper

machine learning, transformers

0

0

0

0

8:28

15/06/2020

Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing

Simon Shillaker, Peter Pietzuch

Keywords Paper

0

0

0

0

22:48

06/12/2020

GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification

John Halloran, David M Rocke

Keywords Paper

0

0

0

0

3:33

18/07/2021

HAWQ-V3: Dyadic Neural Network Quantization

Zhewei Yao, Zhen Dong, Zhangcheng Zheng and
Amir Gholaminejad, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, EECS Kurt Keutzer

Keywords Paper

Applications, Hardware and Systems

0

0

0

0

5:15

04/11/2020

AIFM: High-Performance, Application-Integrated Far Memory

Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, Adam Belay

Keywords Paper

0

0

0

0

20:13

02/02/2021

Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision

Xingchao Liu, Mao Ye, Dengyong Zhou, Qiang Liu

Keywords Paper

0

0

0

0

15:18

14/07/2020

Cache-efficient parallel-partition algorithms using exclusive-read-and-write memory

William Kuszmaul, Alek Westover

Keywords Paper

EREW, parallel partition, in-place algorithms, cache-efficient

0

0

0

0

6:38

23/06/2021

Task Parallel Assembly Language for Uncompromising Parallelism

Mike Rainey, Ryan R. Newton, Kyle Hale and
Nikos Hardavellas, Simone Campanoni, Peter Dinda, Umut A. Acar

Keywords Paper

parallel programming languages, granularity control

0

0

0

0

19:58

03/05/2021

Memory Optimization for Deep Networks

Aashaka Shah, Chao-Yuan Wu, Jayashree Mohan and
Vijay Chidambaram, Philipp Krähenbühl

Keywords Paper

deep network training, checkpointing, memory efficient training, memory optimized training

0

0

1

0

10:23

06/12/2021

Sparse Spiking Gradient Descent

Nicolas Perez-Nieves, Dan Goodman

Keywords Paper

deep learning, optimization

0

0

0

0

14:54

12/07/2020

Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features

Liang Ding, Rui Tuo, Shahin Shahrampour

Keywords Paper

General Machine Learning Techniques

0

0

0

0

15:03

26/08/2020

ASAP: Architecture Search, Anneal and Prune

Asaf Noy, Niv Nayman, Tal Ridnik and
Nadav Zamir, Sivan Doveh, Itamar Friedman, Raja Giryes, Lihi Zelnik

Keywords Paper

0

0

0

0

11:59

04/11/2020

A Tensor Compiler for Unified Machine Learning Prediction Serving

Supun Nakandala, Karla Saur, Gyeong-In Yu and
Konstantinos Karanasos, Carlo Curino, Markus Weimer, Matteo Interlandi

Keywords Paper

0

0

0

0

19:56

02/02/2021

Memory and Computation-Efficient Kernel SVM via Binary Embedding and Ternary Model Coefficients

Zijian Lei, Liang Lan

Keywords Paper

0

0

0

0

12:29

03/05/2021

Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design

Xiufeng Yang, Tanuj Aasawat, Kazuki Yoshizoe

Keywords Paper

molecular design, Upper Confidence bound applied to Trees (UCT), parallel Monte Carlo Tree Search (MCTS)

0

0

0

0

4:59

18/07/2021

I-BERT: Integer-only BERT Quantization

Sehoon Kim, Amir Gholaminejad, Zhewei Yao and
Michael Mahoney, EECS Kurt Keutzer

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

20:26

06/12/2020

Cascaded Text Generation with Markov Transformers

Yuntian Deng, Alexander Rush

Keywords Paper

0

0

0

0

3:30

06/12/2020

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

Keywords Paper

0

0

1

0

3:23

15/06/2020

LLHD: A multi-level intermediate representation for hardware description languages

Fabian Schuiki, Andreas Kurth, Tobias Grosser, Luca Benini

Keywords Paper

intermediate representations, hardware description languages, transformation passes

0

0

0

0

15:59

06/12/2021

Scatterbrain: Unifying Sparse and Low-rank Attention

Beidi Chen, Tri Dao, Eric Winsor and
Zhao Song, Atri Rudra, Christopher Ré

Keywords Paper

transformers, generative model

0

0

0

0

13:15

06/12/2021

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method

Yifan Chen, Qi Zeng, Heng Ji, Yun Yang

Keywords Paper

transformers

0

0

0

0

8:14

12/07/2020

Boosting Deep Neural Network Efficiency with Dual-Module Inference

Liu Liu, Lei Deng, Zhaodong Chen and
yuke wang, Shuangchen Li, Jingwei Zhang, Yihua Yang, Zhenyu Gu, Yufei Ding, Yuan Xie

Keywords Paper

Deep Learning - General

0

0

0

0

8:04

06/12/2021

Gradient-based Hyperparameter Optimization Over Long Horizons

Paul Micaelli, Amos Storkey

Keywords Paper

optimization, meta learning

0

0

0

0

14:44

15/06/2020

Automated derivation of parametric data movement lower bounds for affine programs

Auguste Olivry, Julien Langou, Louis-Noël Pouchet and
P. Sadayappan, Fabrice Rastello

Keywords Paper

Affine programs, Data access complexity, I/O lower bounds, Static analysis

0

0

0

0

14:28

06/12/2020

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming

Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin and
Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang, Pushmeet Kohli

Keywords Paper

0

0

0

0

3:23

11/10/2020

Exact, Parallelizable Dynamic Time Warping Alignment with Linear Memory

Christopher J Tralie, Elizabeth Dempsey

Keywords Paper

MIR tasks, Alignment, synchronization, and score following, Applications, Music retrieval systems, Evaluation, datasets, and reproducibility, Evaluation metrics

0

0

0

0

4:09

19/01/2020

Disentanglement in Nested-Parallel Programs

Sam Westrick, Rohan Yadav, Matthew Fluet, Umut A. Acar

Keywords Paper

memory management, disentanglement, parallel computing, functional programming, data race

0

0

0

0

21:33

18/07/2021

Submodular Maximization subject to a Knapsack Constraint: Combinatorial Algorithms with Near-optimal Adaptive Complexity

Yorgos Amanatidis, Federico Fusco, Philip Lazos and
Stefano Leonardi, Alberto Marchetti-Spaccamela, Rebecca Reiffenhäuser

Keywords Paper

Optimization, Combinatorial Optimization

0

0

0

0

5:15

03/05/2021

Practical Real Time Recurrent Learning with a Sparse Approximation

Jacob Menick, Erich Elsen, Utku Evci and
Simon Osindero, Karen Simonyan, Alex Graves

Keywords Paper

backpropagation, rtrl, real time recurrent learning, forward mode, biologically plausible, bptt, recurrent neural networks

1

1

0

0

10:12

04/11/2020

Caladan: Mitigating Interference at Microsecond Timescales

Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, Adam Belay

Keywords Paper

0

0

0

0

20:07

06/12/2020

MCUNet: Tiny Deep Learning on IoT Devices

Ji Lin, Wei-Ming Chen, Yujun Lin and
john cohn, Chuang Gan, Song Han

Keywords Paper

0

0

0

0

3:13

06/12/2021

Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Aliakbar Panahi, Seyran Saeedi, Tom Arodz

Keywords Paper

transformers

0

0

0

0

13:06

26/04/2020

On the Convergence of FedAvg on Non-IID Data

Xiang Li, Kaixuan Huang, Wenhao Yang and
Shusen Wang, Zhihua Zhang

Keywords Paper

Federated Learning, stochastic optimization, Federated Averaging

0

0

0

0

13:58

04/11/2020

hXDP: Efficient Software Packet Processing on FPGA NICs

Marco Spaziani Brunella, Giacomo Belocchi, Marco Bonola and
Salvatore Pontarelli, Giuseppe Siracusano, Giuseppe Bianchi, Aniello Cammarano, Alessandro Palumbo, Luca Petrucci, Roberto Bifulco

Keywords Paper

0

0

0

0

15:33