Abstract:
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained con-textual embeddings such as BERT. These can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple tasks simultaneously.
In this paper, we alleviate this gap by curating a code-understanding benchmark and evaluating a learned contextual embedding of source code. More specifically, we curate a massive, deduplicated corpus of Python code from GitHub and train a BERT model, which we call B4C. We also create a benchmark comprising five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. For comparison, we train different variants of Word2Vec token embeddings, and BiLSTM and Transformer models. For the repair task, we also compare to SOTA models. We show that fine-tuned B4C models give better results, even with shorter training or fewer examples. Future work on source-code embedding could benefit from reusing our benchmark and comparing against B4C as a strong baseline.