XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

04/07/2020

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Subhabrata Mukherjee, Ahmed Hassan Awadallah

Keywords: natural tasks, knowledge distillation, multilingual Recognition, multilingual NER

Abstract Paper Similar Papers

Abstract: Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to using them in practice. Some recent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multilingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95% of its F1-score for NER over 41 languages.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong and
Hangbo Bao, Nan Yang, Ming Zhou

Keywords Paper

0

0

0

0

3:21

02/02/2021

Reinforced Multi-Teacher Selection for Knowledge Distillation

Fei Yuan, Linjun Shou, Jian Pei and
Wutao Lin, Ming Gong, Yan Fu, Daxin Jiang

Keywords Paper

0

0

0

0

14:18

16/11/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Siqi Sun, Zhe Gan, Yuwei Fang and
Yu Cheng, Shuohang Wang, Jingjing Liu

Keywords Paper

contrastive distillation, compress models, pre-training stages, existing methods

0

0

0

0

8:19

02/02/2021

Progressive Network Grafting for Few-Shot Knowledge Distillation

Chengchao Shen, Xinchao Wang, Youtan Yin and
Jie Song, Sihui Luo, Mingli Song

Keywords Paper

0

0

0

0

9:23

08/12/2020

Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

Fahimeh Saleh, Wray Buntine, Gholamreza Haffari

Keywords Paper

0

0

0

0

9:03

16/11/2020

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

Jianquan Li, Xiaokang Liu, Honghong Zhao and
Ruifeng Xu, Min Yang, Yaohong Jin

Keywords Paper

natural tasks, nlp tasks, matching, many-to-many mapping

0

0

0

0

11:58

14/06/2020

Distilling Cross-Task Knowledge via Relationship Matching

Han-Jia Ye, Su Lu, De-Chuan Zhan

Keywords Paper

knowledge distillation, model reuse, knowledge transfer, cross-task learning, embedding learning

0

0

0

0

4:54

16/11/2020

Adversarial Self-Supervised Data-Free Distillation for Text Classification

Xinyin Ma, Yongliang Shen, Gongfan Fang and
Chen Chen, Chenghao Jia, Weiming Lu

Keywords Paper

nlp tasks, nlp, compressing models, text generation

0

0

0

0

9:36

02/02/2021

Learning to Augment for Data-scarce Domain BERT Knowledge Distillation

Lingyun Feng, Minghui Qiu, Yaliang Li and
Hai-Tao Zheng, Ying Shen

Keywords Paper

0

0

0

0

17:11

16/11/2020

Autoregressive Knowledge Distillation through Imitation Learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei

Keywords Paper

natural tasks, knowledge distillation, exposure problem, prototypical tasks

0

0

0

0

12:43

06/12/2021

Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Ashraful Islam, Chun-Fu (Richard) Chen, Rameswar Panda and
Leonid Karlinsky, Rogerio Feris, Richard J. Radke

Keywords Paper

machine learning, meta learning, few shot learning

0

0

0

0

10:10

04/07/2020

Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language

Qianhui Wu, Zijia Lin, Börje Karlsson and
Jian-Guang Lou, Biqing Huang

Keywords Paper

Single-/Multi-Source NER, named problem, cross-lingual NER, single-source NER

0

0

0

0

10:54

22/11/2021

Object Re-identification Using Teacher-Like and Light Students

Yi Xie, Hanxiao Wu, Fei Shen and
Jianqing Zhu, Huanqiang Zeng

Keywords Paper

object re-identification, knowledge distillation, pruning, re-parameterization

0

0

0

0

3:19

19/04/2021

Annealing knowledge distillation

Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, Ali Ghodsi

Keywords Paper

0

0

0

0

12:38

03/05/2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Yi Ren, Chenxu Hu, Xu Tan and
Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Keywords Paper

end-to-end, non-autoregressive generation, speech synthesis, one-to-many mapping, text to speech

0

0

0

0

7:01

19/08/2021

Object Detection in Densely Packed Scenes via Semi-Supervised Learning with Dual Consistency

Chao Ye, Huaidong Zhang, Xuemiao Xu and
Weiwei Cai, Jing Qin, Kup-Sze Choi

Keywords Paper

Computer Vision, Recognition, Deep Learning, Semi-Supervised Learning

0

0

0

0

10:19

02/02/2021

Knowledge Refinery: Learning from Decoupled Label

Qianggang Ding, Sifan Wu, Tao Dai and
Hao Sun, Jiadong Guo, Zhang-Hua Fu, Shutao Xia

Keywords Paper

0

0

0

0

12:33

30/11/2020

Compensating for the Lack of Extra Training Data by Learning Extra Representation

Hyeonseong Jeon, Siho Han, Sangwon Lee, Simon S. Woo

Keywords Paper

0

0

0

0

9:19

16/11/2020

TeaForN: Teacher-Forcing with N-grams

Sebastian Goodman, Nan Ding, Radu Soricut

Keywords Paper

machine benchmark, news benchmarks, sequence models, teacher-forcing

0

0

0

0

12:02

22/11/2021

PDF-Distil: including Prediction Disagreements in Feature-based Distillation for object detection

Heng ZHANG, Elisa Fromont, Sébastien Lefèvre, Bruno AVIGNON

Keywords Paper

knowledge distillation: object detection

0

0

0

0

2:57

22/11/2021

Class-Balanced Distillation for Long-Tailed Visual Recognition

Ahmet Iscen, Andre Araujo, Boqing Gong, Cordelia Schmid

Keywords Paper

Long tailed recognition, dataset imbalance

0

0

0

0

3:02

04/07/2020

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Zhiqing Sun, Hongkun Yu, Xiaodan Song and
Renjie Liu, Yiming Yang, Denny Zhou

Keywords Paper

Natural NLP, NLP tasks, knowledge transfer, natural tasks

0

0

0

0

11:10

02/02/2021

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

Binbin Xie, Jinsong Su, Yubin Ge and
Xiang Li, Jianwei Cui, Junfeng Yao, Bin Wang

Keywords Paper

0

0

0

0

15:57

02/02/2021

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching

Mingi Ji, Byeongho Heo, Sungrae Park

Keywords Paper

0

0

0

0

14:18

18/07/2021

AlphaNet: Improved Training of Supernets with Alpha-Divergence

Dilin Wang, Chengyue Gong, Meng Li and
Qiang Liu, Vikas Chandra

Keywords Paper

Deep Learning, Architectures

0

0

0

0

16:14

06/12/2021

Comprehensive Knowledge Distillation with Causal Intervention

Xiang Deng, Zhongfei Zhang

Keywords Paper

representation learning, causality

0

0

0

0

12:24

02/02/2021

Asynchronous Teacher Guided Bit-wise Hard Mining for Online Hashing

Sheng Jin, Qin Zhou, Hongxun Yao and
Yao Liu, Xian-Sheng Hua

Keywords Paper

0

0

0

0

17:01

06/12/2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

Keywords Paper

language

0

0

0

0

10:13

03/05/2021

MixKD: Towards Efficient Distillation of Large-scale Language Models

Kevin Liang, Weituo Hao, Dinghan Shen and
Yufan Zhou, Weizhu Chen, Changyou Chen, Lawrence Carin

Keywords Paper

Representation Learning, Natural Language Processing

0

0

0

0

3:52

06/12/2021

Exponential Separation between Two Learning Models and Adversarial Robustness

Grzegorz Gluch, Ruediger Urbanke

Keywords Paper

theory, robustness, adversarial robustness and security

0

0

0

0

15:11

06/12/2021

Unsupervised Representation Transfer for Small Networks: I Believe I Can Distill On-the-Fly

Hee Min Choi, Hyoa Kang, Dokwan Oh

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

3:35

12/07/2020

Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Ahmed T. Elthakeb, Prannoy Pilligundla, FatemehSadat Mireshghallah and
Alexander Cloninger, Hadi Esmaeilzadeh

Keywords Paper

Applications - Other

0

0

0

0

14:29

22/11/2021

Beyond Classification: Knowledge Distillation using Multi-Object Impressions

Gaurav Kumar Nayak, Monish K Keswani, Sharan Seshadri, Anirban Chakraborty

Keywords Paper

Knowledge Distillation (KD), zero-shot, data-free, object detection, data privacy, multi-object impressions, pseudo-data, pseudo-targets, synthetic data, Faster RCNN

0

0

0

0

3:06

22/11/2021

Multi-bit Adaptive Distillation for Binary Neural Networks

Ying Nie, Kai Han, Yunhe Wang

Keywords Paper

binary, distillation, 1bit

0

0

0

0

1:52

19/08/2021

Graph Consistency Based Mean-Teaching for Unsupervised Domain Adaptive Person Re-Identification

Xiaobin Liu, Shiliang Zhang

Keywords Paper

Computer Vision, Recognition, Applications of Unsupervised Learning

0

0

0

0

11:05

18/07/2021

Efficient Iterative Amortized Inference for Learning Symmetric and Disentangled Multi-Object Representations

Patrick Emami, Pan He, Sanjay Ranka, Anand Rangarajan

Keywords Paper

Deep Learning, Embedding and Representation learning

0

0

0

0

5:10

03/05/2021

SEED: Self-supervised Distillation For Visual Representation

Jacob Zhiyuan Fang, Jianfeng Wang, Lijuan Wang and
Lei Zhang, 'YZ' Yezhou Yang, Zicheng Liu

Keywords Paper

Representation Learning, Self Supervised Learning, Knowledge Distillation

0

0

0

0

5:09

03/05/2021

Knowledge distillation via softmax regression representation learning

Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

Keywords Paper

0

0

0

0

4:56

02/02/2021

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu

Keywords Paper

0

0

0

0

18:53

14/06/2020

Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation From a Blackbox Model

Dongdong Wang, Yandong Li, Liqiang Wang, Boqing Gong

Keywords Paper

blackbox knowledge distillation, data-efficient learning, active learning, mixup

0

0

0

0

4:59