Attention is not all you need: pure attention loses rank doubly exponentially with depth

18/07/2021

Attention is not all you need: pure attention loses rank doubly exponentially with depth

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

Keywords: Deep Learning, Architectures

Abstract Paper Similar Papers

Abstract: Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms---or paths---each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming

Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin and
Aditi Raghunathan, Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang, Pushmeet Kohli

Keywords Paper

0

0

0

0

3:23

18/07/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi and
Shailesh Acharya, Glenn Fung, Vikas Singh

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:16

14/09/2020

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Elbruz Ozen, Alex Orailoglu

Keywords Paper

deep learning, information redundancy, pruning

0

0

0

0

14:48

06/12/2020

Untangling tradeoffs between recurrence and self-attention in artificial neural networks

Giancarlo Kerg, bhargav104 Kanuparthi, Anirudh Goyal ALIAS PARTH GOYAL and
Kyle Goyette, Yoshua Bengio, Guillaume Lajoie

Keywords Paper

0

0

0

0

3:20

26/04/2020

Continual learning with hypernetworks

Johannes von Oswald, Christian Henning, João Sacramento, Benjamin F. Grewe

Keywords Paper

Continual Learning, Catastrophic Forgetting, Meta Model, Hypernetwork

0

0

0

0

5:04

03/05/2021

Fast and Complete: Enabling Complete Neural Network Verification with Rapid and Massively Parallel Incomplete Verifiers

Kaidi Xu, Huan Zhang, Shiqi Wang and
Yihan Wang, Suman Jana, Xue Lin, Cho-Jui Hsieh

Keywords Paper

branch and bound, neural network verification

0

0

0

0

5:08

26/04/2020

Minimizing FLOPs to Learn Efficient Sparse Representations

Biswajit Paria, Chih-Kuan Yeh, Ian E.H. Yen and
Ning Xu, Pradeep Ravikumar, Barnabás Póczos

Keywords Paper

sparse embeddings, deep representations, metric learning, regularization

0

0

0

0

4:41

06/12/2021

Efficient Training of Retrieval Models using Negative Cache

Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar

Keywords Paper

deep learning, machine learning

0

0

0

0

10:41

06/12/2020

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Mathilde Caron, Ishan Misra, Julien Mairal and
Priya Goyal, Piotr Bojanowski, Armand Joulin

Keywords Paper

0

1

0

0

3:22

02/02/2021

Knowledge-aware Leap-LSTM: Integrating Prior Knowledge into Leap-LSTM towards Faster Long Text Classification

Jinhua Du, Yan Huang, Karo Moilanen

Keywords Paper

0

0

0

0

19:11

12/07/2020

Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks

Mark Kurtz, Justin Kopinsky, Rati Gelashvili and
Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh

Keywords Paper

Deep Learning - Algorithms

0

0

0

0

14:41

05/01/2021

Can Selfless Learning Improve Accuracy of a Single Classification Task?

Soumya Roy, Bharat Bhusan Sau

Keywords Paper

0

0

0

0

4:58

06/12/2020

Regularizing Towards Permutation Invariance In Recurrent Models

Edo Cohen-Karlik, Avichai Ben David, Amir Globerson

Keywords Paper

0

0

0

0

3:19

06/12/2021

Adaptive Proximal Gradient Methods for Structured Neural Networks

Jihun Yun, Aurelie Lozano, Eunho Yang

Keywords Paper

deep learning, optimization, machine learning

0

0

0

0

10:46

26/04/2020

On the Convergence of FedAvg on Non-IID Data

Xiang Li, Kaixuan Huang, Wenhao Yang and
Shusen Wang, Zhihua Zhang

Keywords Paper

Federated Learning, stochastic optimization, Federated Averaging

0

0

0

0

13:58

18/07/2021

Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization

Neha Wadia, Daniel Duckworth, Samuel Schoenholz and
Ethan Dyer, Jascha Sohl-Dickstein

Keywords Paper

Optimization, Probabilistic Methods, Topic Models, Probabilistic Methods, Latent Variable Models

0

0

0

0

5:17

12/07/2020

How to Solve Fair k-Center in Massive Data Models

Ashish Chiplunkar, Sagar Kale, Sivaramakrishnan Natarajan Ramamoorthy

Keywords Paper

Fairness, Equity, Justice, and Safety

0

0

0

0

13:45

06/12/2020

Adaptive Discretization for Model-Based Reinforcement Learning

Sean Sinclair, Tianyu Wang, Gauri Jain and
Sid Banerjee, Christina Yu

Keywords Paper

0

0

0

0

3:12

26/04/2020

Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML

Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals

Keywords Paper

deep learning analysis, representation learning, meta-learning, few-shot learning

0

0

0

0

5:25

05/12/2020

Self-supervised learning for pairwise data refinement

Gustavo Hernandez Abrego, Bowen Liang, Wei Wang and
Zarana Parekh, Yinfei Yang, Yunhsuan Sung

Keywords Paper

0

0

0

0

15:17

18/07/2021

EL-Attention: Memory Efficient Lossless Attention for Generation

Yu Yan, Jiusheng Chen, Weizhen Qi and
Nikhil Bhendawade, Yeyun Gong, Nan Duan, Ruofei Zhang

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

4:50

06/12/2021

Towards Deeper Deep Reinforcement Learning with Spectral Normalization

Nils Bjorck, Carla Gomes, Kilian Weinberger

Keywords Paper

reinforcement learning and planning, vision, language

0

0

0

0

9:28

06/12/2020

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Soham De, Sam Smith

Keywords Paper

0

0

0

0

3:23

06/12/2021

Dense Unsupervised Learning for Video Segmentation

Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

13:34

06/12/2021

Delayed Gradient Averaging: Tolerate the Communication Latency for Federated Learning

Ligeng Zhu, Hongzhou Lin, Yao Lu and
Yujun Lin, Song Han

Keywords Paper

optimization, machine learning, federated learning

0

0

0

1

14:48

06/12/2020

Bayesian Attention Modules

Xinjie Fan, Shujian Zhang, Bo Chen, Mingyuan Zhou

Keywords Paper

0

0

0

0

3:32

06/12/2020

Sparse Spectrum Warped Input Measures for Nonstationary Kernel Learning

Anthony Tompkins, Rafael Oliveira, Fabio Ramos

Keywords Paper

0

0

0

0

3:20

03/08/2020

Lagrangian Decomposition for Neural Network Verification

Rudy Bunel, Alessandro De Palma, Alban Desmaison and
Krishnamurthy Dvijotham, Pushmeet Kohli, Philip Torr, M. Pawan Kumar

Keywords Paper

0

0

0

0

8:05

14/06/2020

Recursive Least-Squares Estimator-Aided Online Learning for Visual Tracking

Jin Gao, Weiming Hu, Yan Lu

Keywords Paper

online learning, visual tracking, continual learning, recursive least-squares estimation, deep learning, memory retention, recursive learning, mini-batch sgd, normal equation, mlp layer

0

0

0

0

5:01

18/07/2021

Robust Unsupervised Learning via L-statistic Minimization

Andreas Maurer, Daniela Angela Parletta, Andrea Paudice, Massimiliano Pontil

Keywords Paper

Theory, Statistical Learning Theory

0

0

0

0

5:03

19/08/2021

Multi-hop Attention Graph Neural Networks

Guangtao Wang, Rex Ying, Jing Huang, Jure Leskovec

Keywords Paper

Machine Learning, Deep Learning, Learning Graphical Models, Relational Learning

0

0

0

0

12:35

14/06/2020

Unsupervised Person Re-Identification via Softened Similarity Learning

Yutian Lin, Lingxi Xie, Yu Wu and
Chenggang Yan, Qi Tian

Keywords Paper

person re-identification, unsupervised learning

0

0

0

0

1:01

14/06/2020

S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation

Yizhe Zhu, Martin Renqiang Min, Asim Kadav, Hans Peter Graf

Keywords Paper

self-supervised, sequantial vae, representation disentanglement, video generation, video manipulation

0

0

0

0

1:00

06/12/2020

Kernel Methods Through the Roof: Handling Billions of Points Efficiently

Giacomo Meanti, Luigi Carratino, Lorenzo Rosasco, Alessandro Rudi

Keywords Paper

0

0

0

0

3:28

05/01/2021

Self Supervision for Attention Networks

Badri N. Patro, Kasturi G.S., Ansh Jain, Vinay P. Namboodiri

Keywords Paper

0

0

0

0

5:01

02/02/2021

Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision

Xingchao Liu, Mao Ye, Dengyong Zhou, Qiang Liu

Keywords Paper

0

0

0

0

15:18

03/05/2021

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Jungo Kasai, Nikolaos Pappas, Hao Peng and
James Cross, Noah Smith

Keywords Paper

Machine Translation, Sequence Modeling, Natural Language Processing

0

0

0

0

5:04

26/04/2020

PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search

Yuhui Xu, Lingxi Xie, Xiaopeng Zhang and
Xin Chen, Guo-Jun Qi, Qi Tian, Hongkai Xiong

Keywords Paper

Neural Architecture Search, DARTS, Regularization, Normalization

0

0

0

0

4:40

18/07/2021

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

Jianfei Chen, Lianmin Zheng, Zhewei Yao and
Dequan Wang, Ion Stoica, Michael Mahoney, Joseph E Gonzalez

Keywords Paper

Algorithms, Large Scale Learning

0

0

0

0

18:54

05/01/2021

Deep Unsupervised Anomaly Detection

Tangqing Li, Zheng Wang, Siying Liu, Wen-Yan Lin

Keywords Paper

0

0

0

0

5:00