How fine can fine-tuning be? Learning efficient language models

26/08/2020

How fine can fine-tuning be? Learning efficient language models

Evani Radiya-Dixit, Xin Wang

Keywords:

Abstract Paper Similar Papers

Abstract: State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task: the number of fine-tuning steps is typically five orders of magnitude lower than the total parameter count. Does this mean that fine-tuning only introduces \emph{small} differences from the pre-trained model in the parameter space? If so, can one avoid storing and computing an entire model for each task? In this work, we address these questions by using Bidirectional Encoder Representations from Transformers (BERT) as an example. As expected, we find that the fine-tuned models are close in parameter space to the pre-trained one, with the closeness varying from layer to layer. We show that it suffices to fine-tune only the most critical layers. Further, we find that there are surprisingly many \emph{good} solutions in the set of sparsified versions of the pre-trained model. As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero, saving both task-specific parameter storage and computational cost.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at AISTATS 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

Modular Meta-Learning with Shrinkage

Yutian Chen, Abe Friesen, Feryal Behbahani and
Arnaud Doucet, David Budden, Matthew Hoffman, Nando de Freitas

Keywords Paper

0

0

0

0

3:21

19/08/2021

Automatic Mixed-Precision Quantization Search of BERT

Changsheng Zhao, Ting Hua, Yilin Shen and
Qian Lou, Hongxia Jin

Keywords Paper

Machine Learning, Deep Learning, NLP Applications and Tools, Text Classification

0

0

0

0

12:12

03/05/2021

Parameter Efficient Multimodal Transformers for Video Representation Learning

Sangho Lee, Youngjae Yu, Gunhee Kim and
Thomas Breuel, Jan Kautz, Yale Song

Keywords Paper

Self-supervised learning, audio-visual representation learning, video representation learning

0

0

0

0

5:02

16/11/2020

Calibration of Pre-trained Transformers

Shrey Desai, Greg Durrett

Keywords Paper

natural processing, natural inference, paraphrase detection, commonsense reasoning

0

0

0

0

6:55

25/07/2020

A pairwise probe for understanding BERT fine-tuning on machine reading comprehension

Jie Cai, Zhengzhou Zhu, Ping Nie, Qian Liu

Keywords Paper

machine reading comprehension, pairwise, fine-tune, BERT

0

0

0

0

6:38

16/11/2020

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

Brielen Madureira, David Schlangen

Keywords Paper

nlp, interactive systems, language encoders, bidirectional lstms

0

0

0

0

10:04

06/12/2020

Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang, Linli Xu and
Hao-Ran Wei, Boxing Chen, Enhong Chen

Keywords Paper

0

0

0

0

3:17

06/12/2021

Shapeshifter: a Parameter-efficient Transformer using Factorized Reshaped Matrices

Aliakbar Panahi, Seyran Saeedi, Tom Arodz

Keywords Paper

transformers

0

0

0

0

13:06

26/04/2020

Reducing Transformer Depth on Demand with Structured Dropout

Angela Fan, Edouard Grave, Armand Joulin

Keywords Paper

reduction, regularization, pruning, dropout, transformer

0

0

0

0

5:01

06/12/2020

MPNet: Masked and Permuted Pre-training for Language Understanding

Kaitao Song, Xu Tan, Tao Qin and
Jianfeng Lu, Tie-Yan Liu

Keywords Paper

0

0

0

0

3:23

19/04/2021

A neural few-shot text classification reality check

Thomas Dopierre, Christophe Gravier, Wilfried Logerais

Keywords Paper

0

0

0

0

9:47

16/11/2020

Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems

Jindřich Libovický, Alexander Fraser

Keywords Paper

transformer architecture, segmentation, subword model, neural model

0

0

0

0

6:28

18/07/2021

Accurate Post Training Quantization With Small Calibration Sets

Itay Hubara, Yury Nahshan, Yair Hanani and
Ron Banner, Daniel Soudry

Keywords Paper

Algorithms, AutoML

0

0

0

0

5:16

06/12/2020

Top-KAST: Top-K Always Sparse Training

Sid Jayakumar, Razvan Pascanu, Jack Rae and
Simon Osindero, Erich Elsen

Keywords Paper

0

0

0

0

3:18

04/07/2020

GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples

Danilo Croce, Giuseppe Castellucci, Roberto Basili

Keywords Paper

Robust Classification, Natural tasks, image processing, generative setting

0

0

0

0

6:48

02/02/2021

Improving the Efficiency and Effectiveness for BERT-based Entity Resolution

Bing Li, Yukai Miao, Yaoshu Wang and
Yifang Sun, Wei Wang

Keywords Paper

0

1

0

0

14:53

06/12/2021

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

Cheng-I Jeff Lai, Yang Zhang, Alexander Liu and
Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, Jim Glass

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

13:57

04/07/2020

Theoretical Limitations of Self-Attention in Neural Sequence Models

Michael Hahn

Keywords Paper

NLP, Self-Attention Models, Neural Models, Transformers

1

1

0

0

14:02

18/07/2021

LogME: Practical Assessment of Pre-trained Models for Transfer Learning

Kaichao You, Yong Liu, Jianmin Wang, Mingsheng Long

Keywords Paper

Algorithms, Multitask, Transfer, and Meta Learning

1

1

0

0

5:18

06/12/2021

Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder

Keywords Paper

optimization

0

0

0

0

14:16

06/12/2020

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Lu Hou, Zhiqi Huang, Lifeng Shang and
Xin Jiang, Xiao Chen, Qun Liu

Keywords Paper

0

0

0

0

2:59

06/12/2020

Unsupervised Data Augmentation for Consistency Training

Qizhe Xie, Zihang Dai, Eduard Hovy and
Thang Luong, Quoc V Le

Keywords Paper

0

0

0

0

3:29

06/12/2020

Semi-Supervised Neural Architecture Search

Renqian Luo, Xu Tan, Rui Wang and
Tao Qin, Enhong Chen, Tie-Yan Liu

Keywords Paper

0

0

0

0

3:20

06/12/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zi-Hang Jiang, Weihao Yu, Daquan Zhou and
Yunpeng Chen, Jiashi Feng, Shuicheng Yan

Keywords Paper

0

0

0

0

3:20

16/11/2020

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

Anne Lauscher, Vinit Ravishankar, Ivan Vulić, Goran Glavaš

Keywords Paper

zero-shot transfer, downstream transfer, resource-lean scenarios, pos tagging

0

0

0

0

11:45

05/12/2020

Towards non-task-specific distillation of BERT via sentence representation approximation

Bowen Wu, Huan Zhang, MengYuan Li and
Zongsheng Wang, Qihang Feng, Junhong Huang, Baoxun Wang

Keywords Paper

0

0

0

0

10:51

06/12/2021

Searching for Efficient Transformers for Language Modeling

David So, Wojciech Mańke, Hanxiao Liu and
Zihang Dai, Noam Shazeer, Quoc V Le

Keywords Paper

transformers, language

0

0

0

0

13:29

03/05/2021

Random Feature Attention

Hao Peng, Nikolaos Pappas, Dani Yogatama and
Roy Schwartz, Noah Smith, Lingpeng Kong

Keywords Paper

machine translation, transformers, language modeling, Attention

0

0

0

0

10:20

18/07/2021

Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization

Sang Michael Xie, Tengyu Ma, Percy Liang

Keywords Paper

Algorithms, Multitask, Transfer, and Meta Learning

0

0

0

0

22:15

06/12/2020

Unsupervised Translation of Programming Languages

Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, Guillaume Lample

Keywords Paper

0

0

0

0

3:17

18/07/2021

Straight to the Gradient: Learning to Use Novel Tokens for Neural Text Generation

Xiang Lin, Simeng Han, Shafiq Joty

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

16:00

06/12/2021

Combiner: Full Attention Transformer with Sparse Computation Cost

Hongyu Ren, Hanjun Dai, Zihang Dai and
Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai

Keywords Paper

transformers

0

0

0

0

14:31

05/12/2020

An exploratory study on multilingual quality estimation

Shuo Sun, Marina Fomicheva, Frédéric Blain and
Vishrav Chaudhary, Ahmed El-Kishky, Adithya Renduchintala, Francisco Guzmán, Lucia Specia

Keywords Paper

0

0

0

0

14:31

06/12/2021

Scatterbrain: Unifying Sparse and Low-rank Attention

Beidi Chen, Tri Dao, Eric Winsor and
Zhao Song, Atri Rudra, Christopher Ré

Keywords Paper

transformers, generative model

0

0

0

0

13:15

19/10/2020

TwinBERT: Distilling knowledge to twin-structured compressed BERT models for large-scale retrieval

Wenhao Lu, Jian Jiao, Ruofei Zhang

Keywords Paper

knowledge distillation, semantic embedding, sponsored search, bert, information retrieval, deep neural network, deep learning

0

0

0

0

10:20

16/11/2020

TernaryBERT: Distillation-aware Ultra-low Bit BERT

Wei Zhang, Lu Hou, Yichun Yin and
Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu

Keywords Paper

natural tasks, training process, transformer-based models, bert

0

0

0

0

8:41

04/07/2020

Bootstrapping Techniques for Polysynthetic Morphological Analysis

William Lane, Steven Bird

Keywords Paper

Polysynthetic Analysis, Bootstrapping Techniques, natural technologies, linguistically-informed approaches

0

0

0

0

12:12

16/11/2020

Language Model Prior for Low-Resource Neural Machine Translation

Christos Baziotis, Barry Haddow, Alexandra Birch

Keywords Paper

neural translation, neural tm, knowledge distillation, training time

0

0

0

0

11:16

04/07/2020

Do you have the right scissors? Tailoring Pre-trained Language Models via Monte-Carlo Methods

Ning Miao, Yuxuan Song, Hao Zhou, Lei Li

Keywords Paper

over- problem, text tasks, Tailoring Models, Monte-Carlo Methods

0

0

0

0

5:30

14/06/2020

Resolution Adaptive Networks for Efficient Inference

Le Yang, Yizeng Han, Xi Chen and
Shiji Song, Jifeng Dai, Gao Huang

Keywords Paper

adaptive inference, efficient deep learning, multi-scale feature learning, budgeted batch classification

0

0

0

0

0:59