Limits to Depth Efficiencies of Self-Attention

06/12/2020

Limits to Depth Efficiencies of Self-Attention

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata, Amnon Shashua

Keywords:

Abstract Paper Similar Papers

Abstract: Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: Empirical signals indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). In this paper, we theoretically study the interplay between depth and width in self-attention. We shed light on the root of the above phenomenon, and establish two distinct parameter regimes of depth efficiency and inefficiency in self-attention. We invalidate the seemingly plausible hypothesis by which widening is as effective as deepening for self-attention, and show that in fact stacking self-attention layers is so effective that it quickly saturates a capacity of the network width. Specifically, we pinpoint a ``depth threshold" that is logarithmic in the network width: for networks of depth that is below the threshold, we establish a double-exponential depth-efficiency of the self-attention operation, while for depths over the threshold we show that depth-inefficiency kicks in. Our predictions accord with existing empirical ablations, and we further demonstrate the two depth-(in)efficiency regimes experimentally for common network depths of 6, 12, and 24. By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

26/04/2020

Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin

Colin Wei, Tengyu Ma

Keywords Paper

deep learning theory, generalization bounds, adversarially robust generalization, data-dependent generalization bounds

0

0

0

0

5:30

22/11/2021

Parameter Efficient Dynamic Convolution via Tensor Decomposition

Zejiang Hou, Sun-Yuan Kung

Keywords Paper

dynamic convolution, input-dependent reparameterization, parameter efficiency, tensor decomposition

0

0

0

0

3:58

05/01/2021

Unsupervised Domain Adaptation in Semantic Segmentation via Orthogonal and Clustered Embeddings

Marco Toldo, Umberto Michieli, Pietro Zanuttigh

Keywords Paper

0

0

0

0

4:59

03/05/2021

Deep Networks and the Multiple Manifold Problem

Sam Buchanan, Dar Gilboa, John Wright

Keywords Paper

low-dimensional structure, overparameterized neural networks, deep learning

0

0

0

0

5:14

02/02/2021

Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks

Xiangyu Chang, Yingcong Li, Samet Oymak, Christos Thrampoulidis

Keywords Paper

0

0

0

0

18:14

03/05/2021

Robust Overfitting may be mitigated by properly learned smoothening

Tianlong Chen, Zhenyu Zhang, Sijia Liu and
Shiyu Chang, Zhangyang Wang

Keywords Paper

Robust Overfitting, Adversarial Training, Adversarial Robustness

0

0

0

0

4:33

07/09/2020

Paying more Attention to Snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Duong Le, Nhan Vo, Nam Thoai

Keywords Paper

network pruning, knowledge distillation, ensemble learning

0

0

0

0

8:30

12/07/2020

Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

Kei Ota, Tomoaki Oiki, Devesh Jha and
Toshisada Mariyama, Daniel Nikovski

Keywords Paper

Reinforcement Learning - Deep RL

0

0

0

0

14:55

06/12/2021

RED : Looking for Redundancies for Data-FreeStructured Compression of Deep Neural Networks

Edouard YVINEC, Arnaud Dapogny, Matthieu Cord, Kevin Bailly

Keywords Paper

deep learning, vision

0

0

0

0

14:59

05/01/2021

MPRNet: Multi-Path Residual Network for Lightweight Image Super Resolution

Armin Mehri, Parichehr B. Ardakani, Angel D. Sappa

Keywords Paper

0

0

0

0

4:57

12/07/2020

dS^2LBI: Exploring Structural Sparsity on Deep Network via Differential Inclusion Paths

Yanwei Fu, Chen Liu, Donghao Li and
Xinwei Sun, Jinshan ZENG, Yuan Yao

Keywords Paper

Deep Learning - Algorithms

0

0

0

1

12:45

06/12/2021

Improved Transformer for High-Resolution GANs

Long Zhao, Zizhao Zhang, Ting Chen and
Dimitris Metaxas, Han Zhang

Keywords Paper

transformers, generative model

0

0

0

0

12:11

06/12/2020

Adaptive Shrinkage Estimation for Streaming Graphs

Nesreen K. Ahmed, Nick Duffield

Keywords Paper

0

0

0

0

3:23

06/12/2020

Collegial Ensembles

Etai Littwin, Ben Myara, Sima Sabah and
Joshua Susskind, Shuangfei Zhai, Oren Golan

Keywords Paper

0

0

0

0

3:17

14/06/2020

AdderNet: Do We Really Need Multiplications in Deep Learning?

Hanting Chen, Yunhe Wang, Chunjing Xu and
Boxin Shi, Chao Xu, Qi Tian, Chang Xu

Keywords Paper

network acceleration, multiplication-less network, efficient inference

0

0

0

0

4:56

06/12/2021

Local Hyper-Flow Diffusion

Kimon Fountoulakis, Pan Li, Shenghao Yang

Keywords Paper

optimization, graph learning, clustering

0

0

0

0

14:24

06/12/2021

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

Ayush Sekhari, Christoph Dann, Mehryar Mohri and
Yishay Mansour, Karthik Sridharan

Keywords Paper

theory, reinforcement learning and planning

0

0

0

0

11:22

12/07/2020

Better depth-width trade-offs for neural networks through the lens of dynamical systems

Evangelos Chatziafratis, Ioannis Panageas, Sai Ganesh Nagarajan

Keywords Paper

Deep Learning - Theory

0

0

0

0

16:21

26/04/2020

Progressive learning and disentanglement of hierarchical representations

Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, Linwei Wang

Keywords Paper

generative model, disentanglement, progressive learning, VAE

0

0

0

0

5:06

06/12/2021

A Universal Law of Robustness via Isoperimetry

Sebastien Bubeck, Mark Sellke

Keywords Paper

theory, deep learning, robustness, adversarial robustness and security

0

0

0

0

18:54

13/04/2021

On the number of linear functions composing deep neural network: Towards a refined definition of neural networks complexity

Yuuki Takai, Akiyoshi Sannai, Matthieu Cordonnier

Keywords Paper

0

0

0

0

3:09

03/05/2021

Distance-Based Regularisation of Deep Networks for Fine-Tuning

Henry Gouk, Timothy Hospedales, massimiliano pontil

Keywords Paper

Statistical Learning Theory, Transfer Learning, Deep Learning

0

0

0

0

4:57

06/12/2021

Consistent Non-Parametric Methods for Maximizing Robustness

Robi Bhattacharjee, Kamalika Chaudhuri

Keywords Paper

robustness, adversarial robustness and security

0

0

0

0

15:25

06/12/2021

Multi-Agent Reinforcement Learning in Stochastic Networked Systems

Yiheng Lin, Guannan Qu, Longbo Huang, Adam Wierman

Keywords Paper

reinforcement learning and planning, graph learning

0

0

0

0

11:20

22/06/2020

The one-way communication complexity of submodular maximization with applications to streaming and robustness

Moran Feldman, Ashkan Norouzi-Fard, Ola Svensson, Rico Zenklusen

Keywords Paper

Submodular Maximization, Approximation Algorithms, Robustness, Streaming, Communication Complexity

0

0

0

0

24:58

18/07/2021

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

Jianfei Chen, Lianmin Zheng, Zhewei Yao and
Dequan Wang, Ion Stoica, Michael Mahoney, Joseph E Gonzalez

Keywords Paper

Algorithms, Large Scale Learning

0

0

0

0

18:54

02/02/2021

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson

Keywords Paper

0

0

0

0

16:48

06/12/2021

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Ekdeep S Lubana, Robert Dick, Hidenori Tanaka

Keywords Paper

deep learning

0

0

0

0

8:28

03/05/2021

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

Ben Adlam, Jaehoon Lee, Lechao Xiao and
Jeffrey Pennington, Jasper Snoek

Keywords Paper

Deep Learning, Bayesian Neural Networks, Neural Network Gaussian Process, Infinite-Width Limit, Uncertainty, Gaussian Process

0

0

0

0

4:34

14/06/2020

What Makes Training Multi-Modal Classification Networks Hard?

Weiyao Wang, Du Tran, Matt Feiszli

Keywords Paper

video classification, multi-modal, overfitting, action recognition, acoustic event detection

0

0

0

0

1:01

03/08/2020

Brief announcement: Deterministic lower bound for dynamic balanced graph partitioning

Maciej Pacut, Mahmoud Parham, Stefan Schmid

Keywords Paper

online algorithms, graph partitioning, self-adjusting networks

0

0

0

0

10:22

14/06/2020

Forward and Backward Information Retention for Accurate Binary Neural Networks

Haotong Qin, Ruihao Gong, Xianglong Liu and
Mingzhu Shen, Ziran Wei, Fengwei Yu, Jingkuan Song

Keywords Paper

model compression, binary neural networks, deep learning, quantization, computer vision

0

0

0

0

1:00

05/01/2021

Exploiting the Redundancy in Convolutional Filters for Parameter Reduction

Kumara Kahatapitiya, Ranga Rodrigo

Keywords Paper

0

0

0

0

5:10

07/09/2020

Learning Effectively from Noisy Supervision for Weakly Supervised Semantic Segmentation

Wenbin Xie, Qiaoqiao Wei, Zheng Li, Hui Zhang

Keywords Paper

Semantic Segmentation, Weakly Supervised Semantic Segmentation, Self Attention

0

0

0

0

3:46

22/11/2021

Searching for TrioNet: Combining Convolution with Local and Global Self-Attention

Huaijin Pi, Huiyu Wang, Yingwei Li and
Zizhang Li, Alan Yuille

Keywords Paper

Self-Attention, Neural Architecture Search

0

0

0

0

2:56

06/12/2021

The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective

Geoff Pleiss, John Cunningham

Keywords Paper

deep learning, kernel methods

0

0

0

0

6:59

03/05/2021

VA-RED$^2$: Video Adaptive Redundancy Reduction

Bowen Pan, Rameswar Panda, Camilo L Fosco and
Chung-Ching Lin, Alex J Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

Keywords Paper

0

0

0

0

5:02

06/12/2021

Efficient Equivariant Network

Lingshen He, Yuxuan Chen, zhengyang shen and
Yiming Dong, Yisen Wang, Zhouchen Lin

Keywords Paper

deep learning, vision

0

0

0

0

8:20

26/04/2020

Batch-shaping for learning conditional channel gated networks

Babak Ehteshami Bejnordi, Tijmen Blankevoort, Max Welling

Keywords Paper

Conditional computation, channel gated networks, gating, Batch-shaping, distribution matching, image classification, semantic segmentation

0

0

0

0

5:26

18/07/2021

Large-Scale Meta-Learning with Continual Trajectory Shifting

JWoong Shin, Hae Beom Lee, Boqing Gong, Sung Ju Hwang

Keywords Paper

Algorithms, Multitask, Transfer, and Meta Learning

0

0

0

0

6:14