BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

16/11/2020

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin

Keywords: natural tasks, nlp tasks, matching, many-to-many mapping

Abstract Paper Similar Papers

Abstract: Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for different NLP tasks. In addition, we leverage Earth Mover′s Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables effective matching for the many-to-many layer mapping. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model′s performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at EMNLP 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei, Li Dong and
Hangbo Bao, Nan Yang, Ming Zhou

Keywords Paper

0

0

0

0

3:21

02/02/2021

Learning to Augment for Data-scarce Domain BERT Knowledge Distillation

Lingyun Feng, Minghui Qiu, Yaliang Li and
Hai-Tao Zheng, Ying Shen

Keywords Paper

0

0

0

0

17:11

02/02/2021

Cross-Layer Distillation with Semantic Calibration

Defang Chen, Jian-Ping Mei, Yuan Zhang and
Can Wang, Zhe Wang, Yan Feng, Chun Chen

Keywords Paper

0

0

0

0

17:05

02/02/2021

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

Binbin Xie, Jinsong Su, Yubin Ge and
Xiang Li, Jianwei Cui, Junfeng Yao, Bin Wang

Keywords Paper

0

0

0

0

15:57

02/02/2021

A Student-Teacher Architecture for Dialog Domain Adaptation Under the Meta-Learning Setting

Kun Qian, Wei Wei, Zhou Yu

Keywords Paper

0

0

0

0

14:46

02/02/2021

Reinforced Multi-Teacher Selection for Knowledge Distillation

Fei Yuan, Linjun Shou, Jian Pei and
Wutao Lin, Ming Gong, Yan Fu, Daxin Jiang

Keywords Paper

0

0

0

0

14:18

18/07/2021

AlphaNet: Improved Training of Supernets with Alpha-Divergence

Dilin Wang, Chengyue Gong, Meng Li and
Qiang Liu, Vikas Chandra

Keywords Paper

Deep Learning, Architectures

0

0

0

0

16:14

02/02/2021

Progressive Network Grafting for Few-Shot Knowledge Distillation

Chengchao Shen, Xinchao Wang, Youtan Yin and
Jie Song, Sihui Luo, Mingli Song

Keywords Paper

0

0

0

0

9:23

06/12/2021

Adversarial Teacher-Student Representation Learning for Domain Generalization

Fu-En Yang, Yuan-Chia Cheng, Zu-Yun Shiau, Yu-Chiang Frank Wang

Keywords Paper

machine learning, vision, domain adaptation, representation learning

0

0

0

0

8:14

06/12/2021

Unsupervised Representation Transfer for Small Networks: I Believe I Can Distill On-the-Fly

Hee Min Choi, Hyoa Kang, Dokwan Oh

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

3:35

19/04/2021

Annealing knowledge distillation

Aref Jafari, Mehdi Rezagholizadeh, Pranav Sharma, Ali Ghodsi

Keywords Paper

0

0

0

0

12:38

04/07/2020

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Subhabrata Mukherjee, Ahmed Hassan Awadallah

Keywords Paper

natural tasks, knowledge distillation, multilingual Recognition, multilingual NER

0

0

0

0

10:58

08/12/2020

Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

Fahimeh Saleh, Wray Buntine, Gholamreza Haffari

Keywords Paper

0

0

0

0

9:03

02/02/2021

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching

Mingi Ji, Byeongho Heo, Sungrae Park

Keywords Paper

0

0

0

0

14:18

19/08/2021

Object Detection in Densely Packed Scenes via Semi-Supervised Learning with Dual Consistency

Chao Ye, Huaidong Zhang, Xuemiao Xu and
Weiwei Cai, Jing Qin, Kup-Sze Choi

Keywords Paper

Computer Vision, Recognition, Deep Learning, Semi-Supervised Learning

0

0

0

0

10:19

03/08/2020

Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

Michael Wan, Tanmay Gangwani, Jian Peng

Keywords Paper

0

0

0

0

6:27

14/06/2020

Search to Distill: Pearls Are Everywhere but Not the Eyes

Yu Liu, Xuhui Jia, Mingxing Tan and
Raviteja Vemulapalli, Yukun Zhu, Bradley Green, Xiaogang Wang

Keywords Paper

neural architecture search, knowledge distillation, nas, neural architecture

0

0

0

0

4:26

14/06/2020

Knowledge As Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge

Long Zhao, Xi Peng, Yuxiao Chen and
Mubbasir Kapadia, Dimitris N. Metaxas

Keywords Paper

knowledge distillation, transfer learning, meta-learning, 3d hand pose estimation

0

0

0

0

1:01

16/11/2020

Adversarial Self-Supervised Data-Free Distillation for Text Classification

Xinyin Ma, Yongliang Shen, Gongfan Fang and
Chen Chen, Chenghao Jia, Weiming Lu

Keywords Paper

nlp tasks, nlp, compressing models, text generation

0

0

0

0

9:36

22/11/2021

Semi-Online Knowledge Distillation

Zhiqiang Liu, Yanxia Liu, Chengkai Huang

Keywords Paper

Knowledge Distillation, Model Compression

0

0

0

0

3:00

14/06/2020

Few Sample Knowledge Distillation for Efficient Network Compression

Tianhong Li, Jianguo Li, Zhuang Liu, Changshui Zhang

Keywords Paper

efficient network compression, few samples, knowledge distillation

0

0

0

0

1:01

30/11/2020

Online Knowledge Distillation via Multi-branch Diversity Enhancement

Zheng Li, Ying Huang, Defang Chen and
Tianren Luo, Ning Cai, Zhigeng Pan

Keywords Paper

0

0

0

0

5:36

03/05/2021

Knowledge distillation via softmax regression representation learning

Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos

Keywords Paper

0

0

0

0

4:56

02/02/2021

Robust Knowledge Transfer via Hybrid Forward on the Teacher-Student Model

Liangchen Song, Jialian Wu, Ming Yang and
Qian Zhang, Yuan Li, Junsong Yuan

Keywords Paper

0

0

0

0

16:09

02/02/2021

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu

Keywords Paper

0

0

0

0

18:53

14/06/2020

Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN

Jingwen Ye, Yixin Ji, Xinchao Wang and
Xin Gao, Mingli Song

Keywords Paper

knowledge amalgamation, data-free training, gan, adversarial learning, machine learning architectures

0

0

0

0

1:04

08/12/2020

Query Distillation: BERT-based Distillation for Ensemble Ranking

Wangshu Zhang, Junhong Liu, Zujie Wen and
Yafang Wang, Gerard de Melo

Keywords Paper

0

0

0

0

15:01

19/08/2021

Hierarchical Self-supervised Augmented Knowledge Distillation

Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu

Keywords Paper

Computer Vision, Recognition

0

0

0

0

13:30

19/10/2020

Ensembled CTR prediction via knowledge distillation

Jieming Zhu, Jinyang Liu, Weiqi Li and
Jincai Lai, Xiuqiang He, Liang Chen, Zibin Zheng

Keywords Paper

model ensemble, knowledge distillation, ctr prediction, online advertising, recommender systems

0

0

0

0

9:26

04/07/2020

Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language

Qianhui Wu, Zijia Lin, Börje Karlsson and
Jian-Guang Lou, Biqing Huang

Keywords Paper

Single-/Multi-Source NER, named problem, cross-lingual NER, single-source NER

0

0

0

0

10:54

16/11/2020

Amalgamating Knowledge from Two Teachers for Task-oriented Dialogue System with Adversarial Training

Wanwei He, Min Yang, Rui Yan and
Chengming Li, Ying Shen, Ruifeng Xu

Keywords Paper

task completion, generating responses, task-oriented dialogue, task-oriented systems

0

0

0

0

9:15

14/06/2020

Creating Something From Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing

Hengtong Hu, Lingxi Xie, Richang Hong, Qi Tian

Keywords Paper

cross-model retrieval, cross-model hashing, unsupervised learning, knowledge distillation, similarity calculation, teacher-student network, supervised learning, unsupervised model guiding supervised model, exploiting information from output features, achieving sota performance

0

0

0

0

1:00

06/12/2021

Learning Student-Friendly Teacher Networks for Knowledge Distillation

Dae Young Park, Moon-Hyun Cha, changwook jeong and
Daesin Kim, Bohyung Han

Keywords Paper

deep learning, transfer learning

0

0

0

0

13:41

06/12/2021

Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Ashraful Islam, Chun-Fu (Richard) Chen, Rameswar Panda and
Leonid Karlinsky, Rogerio Feris, Richard J. Radke

Keywords Paper

machine learning, meta learning, few shot learning

0

0

0

0

10:10

18/07/2021

A Theory of Label Propagation for Subpopulation Shift

Tianle Cai, Ruiqi Gao, Jason Lee, Qi Lei

Keywords Paper

Theory, Statistical Learning Theory

0

1

0

0

5:08

03/05/2021

Knowledge Distillation as Semiparametric Inference

Tri Dao, Govinda Kamath, Vasilis Syrgkanis, Lester Mackey

Keywords Paper

generalization bounds, knowledge distillation, model compression, loss correction, orthogonal machine learning, cross-fitting, semiparametric inference

0

0

0

0

5:10

13/04/2021

Understanding robustness in teacher-student setting: A new perspective

Zhuolin Yang, Zhaoxi Chen, Tiffany Cai and
Xinyun Chen, Bo Li, Yuandong Tian

Keywords Paper

0

0

0

0

3:03

06/12/2021

Iterative Teacher-Aware Learning

Luyao Yuan, Dongruo Zhou, Junhong Shen and
Jingdong Gao, Jeffrey L Chen, Quanquan Gu, Ying Nian Wu, Song-Chun Zhu

Keywords Paper

theory, optimization, reinforcement learning and planning, machine learning

0

0

0

0

6:40

05/01/2021

Effectiveness of Arbitrary Transfer Sets for Data-Free Knowledge Distillation

Gaurav Kumar Nayak, Konda Reddy Mopuri, Anirban Chakraborty

Keywords Paper

0

0

0

0

5:00

03/05/2021

SEED: Self-supervised Distillation For Visual Representation

Jacob Zhiyuan Fang, Jianfeng Wang, Lijuan Wang and
Lei Zhang, 'YZ' Yezhou Yang, Zicheng Liu

Keywords Paper

Representation Learning, Self Supervised Learning, Knowledge Distillation

0

0

0

0

5:09