Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Abstract: Language models employ a very large number of trainable parameters. Despite being highly overparameterized, these networks often achieve good out-of-sample test performance on the original task and easily fine-tune to related tasks. Recent observations involving, for example, intrinsic dimension of the objective landscape and the lottery ticket hypothesis, indicate that often training actively involves only a small fraction of the parameter space. Thus, a question remains how large a parameter space needs to be in the first place –- the evidence from recent work on model compression, parameter sharing, factorized representations, and knowledge distillation increasingly shows that models can be made much smaller and still perform well. Here, we focus on factorized representations of matrices that underpin dense, embedding, and self-attention layers. We use low-rank factorized representation of a reshaped and rearranged original matrix to achieve space efficient and expressive linear layers. We prove that stacking such low-rank layers increases their expressiveness, providing theoretical understanding for their effectiveness in deep networks. In Transformer models, our approach leads to more than ten-fold reduction in the number of total trainable parameters, including embedding, attention, and feed-forward layers, with little degradation in on-task performance. The approach operates out-of-the-box, replacing each parameter matrix with its compact equivalent while maintaining the architecture of the network.

14/09/2020

Le Yang, Yizeng Han, Xi Chen and
Shiji Song, Jifeng Dai, Gao Huang

Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Aliakbar Panahi, Seyran Saeedi, Tom Arodz

Comments

Similar Papers

Squeezing Correlated Neurons for Resource-Efficient Deep Neural Networks

Elbruz Ozen, Alex Orailoglu

Keywords Abstract Paper

deep learning, information redundancy, pruning

Continuous Self-Attention Models with Neural ODE Networks

Jing Zhang, Peng Zhang, Baiwen Kong and Junqiu Wei, Xin Jiang

Keywords Abstract Paper

Top-KAST: Top-K Always Sparse Training

Sid Jayakumar, Razvan Pascanu, Jack Rae and Simon Osindero, Erich Elsen

Keywords Abstract Paper

Resolution Adaptive Networks for Efficient Inference

Le Yang, Yizeng Han, Xi Chen and Shiji Song, Jifeng Dai, Gao Huang

Keywords Abstract Paper

adaptive inference, efficient deep learning, multi-scale feature learning, budgeted batch classification

Minimizing FLOPs to Learn Efficient Sparse Representations

Biswajit Paria, Chih-Kuan Yeh, Ian E.H. Yen and Ning Xu, Pradeep Ravikumar, Barnabás Póczos

Keywords Abstract Paper

sparse embeddings, deep representations, metric learning, regularization

Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online

Yangchen Pan, Kirby Banman, Martha White

Keywords Abstract Paper

natural sparsity, Reinforcement learning, fuzzy tiling activation function, sparse representation

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

Yulin Wang, Zanlin Ni, Shiji Song and Le Yang, Gao Huang

Keywords Abstract Paper

Deep learning, Locally supervised training

Network Pruning by Greedy Subnetwork Selection

Mao Ye, Chengyue Gong, Lizhen Nie and Denny Zhou, Adam Klivans, Qiang Liu

Keywords Abstract Paper

Deep Learning - General

Naive Feature Selection: Sparsity in Naive Bayes

Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui

Keywords Abstract Paper

Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies

Paul Pu Liang, Manzil Zaheer, Yuan Wang, Amr Ahmed

Keywords Abstract Paper

text classification, recommendation systems, large vocabularies, sparse embeddings, language modeling

Structured Multi-Hashing for Model Compression

Elad Eban, Yair Movshovitz-Attias, Hao Wu and Mark Sandler, Andrew Poon, Yerlan Idelbayev, Miguel Á. Carreira-Perpiñán

Keywords Abstract Paper

compression, weight hashing, on device

Spike-Thrift: Towards Energy-Efficient Deep Spiking Neural Networks by Limiting Spiking Activity via Attention-Guided Compression

Souvik Kundu, Gourav Datta, Massoud Pedram, Peter A. Beerel

Keywords Abstract Paper

ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations

Rishabh Tiwari, Udbhav Bamba, Arnav Chavan, Deepak Gupta

Keywords Abstract Paper

Budget constraints, Budget-Aware Pruning, Structured Pruning, Sparsity Learning

Scatterbrain: Unifying Sparse and Low-rank Attention

Beidi Chen, Tri Dao, Eric Winsor and Zhao Song, Atri Rudra, Christopher Ré

Keywords Abstract Paper

transformers, generative model

Paying more Attention to Snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Duong Le, Nhan Vo, Nam Thoai

Keywords Abstract Paper

network pruning, knowledge distillation, ensemble learning

Quasi-Newton Solver for Robust Non-Rigid Registration

Yuxin Yao, Bailin Deng, Weiwei Xu, Juyong Zhang

Keywords Abstract Paper

non-rigid registration, robust estimator, quasi-newton, welsch's function, mm algorithm, l-bfgs, deformation graph.

Memory and Computation-Efficient Kernel SVM via Binary Embedding and Ternary Model Coefficients

Zijian Lei, Liang Lan

Keywords Abstract Paper

Efficient Training of Retrieval Models using Negative Cache

Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar

Keywords Abstract Paper

deep learning, machine learning

Aligned Structured Sparsity Learning for Efficient Image Super-Resolution

Yulun Zhang, Huan Wang, Can Qin, Yun Fu

Keywords Abstract Paper

deep learning

Neural Pruning via Growing Regularization

Huan Wang, Can Qin, Yulun Zhang, Yun Fu

Keywords Abstract Paper

deep neural network pruning, regularization, Hessian matrix, model compression

Inducing and Exploiting Activation Sparsity for Fast Inference on Deep Neural Networks

Keywords Paper

Jing Zhang, Peng Zhang, Baiwen Kong and
Junqiu Wei, Xin Jiang

Keywords Paper

Sid Jayakumar, Razvan Pascanu, Jack Rae and
Simon Osindero, Erich Elsen

Keywords Paper

Le Yang, Yizeng Han, Xi Chen and
Shiji Song, Jifeng Dai, Gao Huang

Keywords Paper

Biswajit Paria, Chih-Kuan Yeh, Ian E.H. Yen and
Ning Xu, Pradeep Ravikumar, Barnabás Póczos

Keywords Paper

Keywords Paper

Yulin Wang, Zanlin Ni, Shiji Song and
Le Yang, Gao Huang

Keywords Paper

Mao Ye, Chengyue Gong, Lizhen Nie and
Denny Zhou, Adam Klivans, Qiang Liu

Keywords Paper

Keywords Paper

Keywords Paper

Elad Eban, Yair Movshovitz-Attias, Hao Wu and
Mark Sandler, Andrew Poon, Yerlan Idelbayev, Miguel Á. Carreira-Perpiñán

Keywords Paper

Keywords Paper

Keywords Paper

Beidi Chen, Tri Dao, Eric Winsor and
Zhao Song, Atri Rudra, Christopher Ré

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Mark Kurtz, Justin Kopinsky, Rati Gelashvili and
Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, Dan Alistarh

Keywords Paper

Yutian Chen, Abe Friesen, Feryal Behbahani and
Arnaud Doucet, David Budden, Matthew Hoffman, Nando de Freitas

Keywords Paper

Keywords Paper

Xuehui Wang, qing wang, Yuzhi Zhao and
Junchi Yan, Lei Fan, long chen

Keywords Paper

Keywords Paper

Atsushi Yaguchi, Taiji Suzuki, Shuhei Nitta and
Yukinobu Sakata, Akiyuki Tanizawa

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani and
Rathinakumar Appuswamy, Dharmendra S. Modha

Keywords Paper

Chen Gao, Yinfeng Li, Quanming Yao and
Depeng Jin, Yong Li

Keywords Paper

Zhuohan Li, Eric Wallace, Sheng Shen and
Kevin Lin, Kurt Keutzer, Dan Klein, Joseph Gonzalez

Keywords Paper

Jonathan Schwarz, Siddhant M Jayakumar, Razvan Pascanu and
Peter E Latham, Yee Teh

Keywords Paper

Gellert Weisz, András György, Wei-I Lin and
Devon Graham, Kevin Leyton-Brown, Csaba Szepesvari, Brendan Lucier

Keywords Paper

Yuxuan Song, Ning Miao, Hao Zhou and
Lantao Yu, Mingxuan Wang, Lei Li

Keywords Paper

Dawei Gao, Xiaoxi He, Zimu Zhou and
Yongxin Tong, Ke Xu, Lothar Thiele

Keywords Paper