19/10/2020

TwinBERT: Distilling knowledge to twin-structured compressed BERT models for large-scale retrieval

Wenhao Lu, Jian Jiao, Ruofei Zhang

Keywords: knowledge distillation, semantic embedding, sponsored search, bert, information retrieval, deep neural network, deep learning

Abstract: Pre-trained language models have achieved great success in a wide variety of natural language processing (NLP) tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency information retrieval (IR) systems. To address the problem, we present TwinBERT model, which has two improvements: 1) represent query and document separately using twin-structured encoders and 2) each encoder is a highly compressed BERT-like model with less than one third of the parameters. The former allows document embeddings to be pre-computed offline and cached in memory, which is different from BERT, where the two input sentences are concatenated and encoded together. The change saves large amount of computation time, however, it is still not sufficient for real-time retrieval considering the complexity of BERT model itself. To further reduce computational cost, a compressed multi-layer transformer encoder is proposed with special training strategies as a substitution of the original complex BERT encoder. Lastly, two versions of TwinBERT are developed to combine the query and keyword embeddings for retrieval and relevance tasks correspondingly. Both of them have met the real-time latency requirement and achieve close or on-par performance to BERT-Base model.The models were trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was for the first time controlled within 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models in production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency. The models were released in 2019 with significant production impacts.

The video of this talk cannot be embedded. You can watch it here:
https://dl.acm.org/doi/10.1145/3340531.3412747#sec-supp
(Link will open in new window)
 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at CIKM 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers