TwinBERT: Distilling knowledge to twin-structured compressed BERT models for large-scale retrieval

Abstract: Pre-trained language models have achieved great success in a wide variety of natural language processing (NLP) tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency information retrieval (IR) systems. To address the problem, we present TwinBERT model, which has two improvements: 1) represent query and document separately using twin-structured encoders and 2) each encoder is a highly compressed BERT-like model with less than one third of the parameters. The former allows document embeddings to be pre-computed offline and cached in memory, which is different from BERT, where the two input sentences are concatenated and encoded together. The change saves large amount of computation time, however, it is still not sufficient for real-time retrieval considering the complexity of BERT model itself. To further reduce computational cost, a compressed multi-layer transformer encoder is proposed with special training strategies as a substitution of the original complex BERT encoder. Lastly, two versions of TwinBERT are developed to combine the query and keyword embeddings for retrieval and relevance tasks correspondingly. Both of them have met the real-time latency requirement and achieve close or on-par performance to BERT-Base model.The models were trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was for the first time controlled within 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models in production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency. The models were released in 2019 with significant production impacts.