Fast Transformers with Clustered Attention

06/12/2020

Fast Transformers with Clustered Attention

Apoorv Vyas, Angelos Katharopoulos, François Fleuret

Keywords:

Abstract Paper Similar Papers

Abstract: Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

03/05/2021

Random Feature Attention

Hao Peng, Nikolaos Pappas, Dani Yogatama and
Roy Schwartz, Noah Smith, Lingpeng Kong

Keywords Paper

machine translation, transformers, language modeling, Attention

0

0

0

0

10:20

06/12/2021

Scatterbrain: Unifying Sparse and Low-rank Attention

Beidi Chen, Tri Dao, Eric Winsor and
Zhao Song, Atri Rudra, Christopher Ré

Keywords Paper

transformers, generative model

0

0

0

0

13:15

06/12/2020

SMYRF - Efficient Attention using Asymmetric Clustering

Giannis Daras, Nikita Kitaev, Augustus Odena, Alex Dimakis

Keywords Paper

0

0

0

0

3:28

06/12/2020

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli and
Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

Keywords Paper

0

0

0

0

3:23

06/12/2021

Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren, Hanjun Dai, Zihang Dai and
Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai

Keywords Paper

transformers

0

0

0

0

14:31

06/12/2020

BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits

Mo Tiwari, Martin Zhang, James J Mayclin and
Sebastian Thrun, Chris Piech, Ilan Shomorony

Keywords Paper

0

0

0

0

3:16

06/12/2021

Long-Short Transformer: Efficient Transformers for Language and Vision

Chen Zhu, Wei Ping, Chaowei Xiao and
Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

Keywords Paper

machine learning, transformers

0

0

0

0

11:44

06/12/2020

Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang, Linli Xu and
Hao-Ran Wei, Boxing Chen, Enhong Chen

Keywords Paper

0

0

0

0

3:17

18/07/2021

Streaming and Distributed Algorithms for Robust Column Subset Selection

Shuli Jiang, Dongyu Li, Irene Mengze Li and
Arvind Mahankali, David Woodruff

Keywords Paper

Algorithms, Deep Learning, Generative Models, Deep Learning, Predictive Models; Deep Learning, Recurrent Networks

0

0

0

0

7:26

06/12/2021

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Shengjie Luo, Shanda Li, Tianle Cai and
Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

Keywords Paper

optimization, machine learning, transformers, vision

0

0

0

0

10:07

06/12/2021

Coresets for Clustering with Missing Values

Vladimir Braverman, Shaofeng Jiang, Robert Krauthgamer, Xuan Wu

Keywords Paper

clustering

0

0

0

0

10:33

02/02/2021

Scalable Affinity Propagation for Massive Datasets

Hiroaki Shiokawa

Keywords Paper

0

0

0

0

18:30

06/12/2021

Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Aliakbar Panahi, Seyran Saeedi, Tom Arodz

Keywords Paper

transformers

0

0

0

0

13:06

06/12/2020

Big Bird: Transformers for Longer Sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey and
Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

Keywords Paper

0

0

0

0

3:17

02/02/2021

Improving the Efficiency and Effectiveness for BERT-based Entity Resolution

Bing Li, Yukai Miao, Yaoshu Wang and
Yifang Sun, Wei Wang

Keywords Paper

0

1

0

0

14:53

18/07/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi and
Shailesh Acharya, Glenn Fung, Vikas Singh

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:16

06/12/2020

Exact Recovery of Mangled Clusters with Same-Cluster Queries

Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice

Keywords Paper

Algorithms -> Image Segmentation; Applications -> Computer Vision; Applications -> Image Segmentation; Applications -> Visual S, Deep Learning -> Visualization or Exposition Techniques for Deep Networks

0

0

0

0

3:13

13/04/2021

Hierarchical clustering in general metric spaces using approximate nearest neighbors

Benjamin Moseley, Sergei Vassilvtiskii, Yuyan Wang

Keywords Paper

0

0

0

0

2:52

06/12/2020

The Adaptive Complexity of Maximizing a Gross Substitutes Valuation

Ron Kupfer, Sharon Qian, Eric Balkanski, Yaron Singer

Keywords Paper

0

0

0

0

3:03

02/02/2021

An Efficient Transformer Decoder with Compressed Sub-layers

Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu

Keywords Paper

0

0

0

0

19:56

12/07/2020

Fast Differentiable Sorting and Ranking

Mathieu Blondel, Olivier Teboul, Quentin Berthet, Josip Djolonga

Keywords Paper

Sequential, Network, and Time-Series Modeling

0

0

0

0

14:26

06/12/2021

Searching for Efficient Transformers for Language Modeling

David So, Wojciech Mańke, Hanxiao Liu and
Zihang Dai, Noam Shazeer, Quoc V Le

Keywords Paper

transformers, language

0

0

0

0

13:29

26/08/2020

Naive Feature Selection: Sparsity in Naive Bayes

Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui

Keywords Paper

0

0

0

0

14:32

02/02/2021

Approximate Multiplication of Sparse Matrices with Limited Space

Yuanyu Wan, Lijun Zhang

Keywords Paper

0

0

0

0

19:26

12/07/2020

Improving Transformer Optimization Through Better Initialization

Xiao Shi Huang, Felipe Perez, Jimmy Ba, Maksims Volkovs

Keywords Paper

Sequential, Network, and Time-Series Modeling

0

0

0

0

14:52

06/12/2021

Sparse is Enough in Scaling Transformers

Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin and
Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

Keywords Paper

machine learning, transformers

0

0

0

0

8:28

06/12/2021

A Faster Decentralized Algorithm for Nonconvex Minimax Problems

Wenhan Xian, Feihu Huang, Yanfu Zhang, Heng Huang

Keywords Paper

optimization, machine learning, adversarial robustness and security

0

0

0

0

13:59

06/12/2021

Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics

Man Shun Ang, Jianzhu Ma, Nianjun Liu and
Kun Huang, Yijie Wang

Keywords Paper

theory

0

0

0

0

15:00

23/06/2021

Hashing Modulo Alpha-Equivalence

Krzysztof Maziarz, Tom Ellis, Alan Lawrence and
Andrew Fitzgibbon, Simon Peyton Jones

Keywords Paper

hashing, abstract syntax tree, equivalence

0

0

0

0

17:38

02/02/2021

Continuous Self-Attention Models with Neural ODE Networks

Jing Zhang, Peng Zhang, Baiwen Kong and
Junqiu Wei, Xin Jiang

Keywords Paper

0

0

0

0

15:25

06/12/2020

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Charles Margossian, Aki Vehtari, Daniel Simpson, Raj Agrawal

Keywords Paper

0

0

0

0

3:05

06/12/2021

Space-time Mixing Attention for Video Transformer

Adrian Bulat, Juan Manuel Perez Rua, Swathikiran Sudhakaran and
Brais Martinez, Georgios Tzimiropoulos

Keywords Paper

transformers

0

0

0

0

10:25

02/02/2021

Memory and Computation-Efficient Kernel SVM via Binary Embedding and Ternary Model Coefficients

Zijian Lei, Liang Lan

Keywords Paper

0

0

0

0

12:29

22/11/2021

End-to-End Object Detection with Adaptive Clustering Transformer

Minghang Zheng, Peng Gao, Renrui Zhang and
Kunchang Li, Hongsheng Li, Hao Dong

Keywords Paper

transformer, object detection

0

0

0

0

9:50

06/12/2021

Multiclass Boosting and the Cost of Weak Learning

Nataly Brukhim, Elad Hazan, Shay Moran and
Indraneel Mukherjee, Robert E Schapire

Keywords Paper

machine learning

0

0

0

0

7:45

06/12/2021

Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder

Keywords Paper

optimization

0

0

0

0

14:16

19/04/2021

Progressively pretrained dense corpus index for open-domain question answering

Wenhan Xiong, Hong Wang, William Yang Wang

Keywords Paper

0

0

0

0

12:15

19/08/2021

Automatic Mixed-Precision Quantization Search of BERT

Changsheng Zhao, Ting Hua, Yilin Shen and
Qian Lou, Hongxia Jin

Keywords Paper

Machine Learning, Deep Learning, NLP Applications and Tools, Text Classification

0

0

0

0

12:12

06/12/2021

Robust and Fully-Dynamic Coreset for Continuous-and-Bounded Learning (With Outliers) Problems

Zixiu Wang, Yiwen Guo, Hu Ding

Keywords Paper

optimization, machine learning, adversarial robustness and security, clustering

0

0

0

0

8:38

06/12/2020

Robust Meta-learning for Mixed Linear Regression with Small Batches

Weihao Kong, Raghav Somani, Sham Kakade, Sewoong Oh

Keywords Paper

0

0

0

0

3:20