Learning and Evaluating Contextual Embedding of Source Code

12/07/2020

Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Keywords: Representation Learning

Abstract Paper Similar Papers

Abstract: Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained con-textual embeddings such as BERT. These can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple tasks simultaneously. In this paper, we alleviate this gap by curating a code-understanding benchmark and evaluating a learned contextual embedding of source code. More specifically, we curate a massive, deduplicated corpus of Python code from GitHub and train a BERT model, which we call B4C. We also create a benchmark comprising five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. For comparison, we train different variants of Word2Vec token embeddings, and BiLSTM and Transformer models. For the repair task, we also compare to SOTA models. We show that fine-tuned B4C models give better results, even with shorter training or fewer examples. Future work on source-code embedding could benefit from reusing our benchmark and comparing against B4C as a strong baseline.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

16/11/2020

PyMT5: multi-mode translation of natural language and Python code with transformers

Colin Clement, Dawn Drain, Jonathan Timcheck and
Alexey Svyatkovskiy, Neel Sundaresan

Keywords Paper

automated understanding, docstring generation, method generation, docstring summarization

0

0

0

0

10:43

06/12/2021

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, Guillaume Lample

Keywords Paper

self-supervised learning

0

0

0

0

13:09

06/12/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zi-Hang Jiang, Weihao Yu, Daquan Zhou and
Yunpeng Chen, Jiashi Feng, Shuicheng Yan

Keywords Paper

0

0

0

0

3:20

19/08/2021

Towards Generating Summaries for Lexically Confusing Code through Code Erosion

Fan Yan, Ming Li

Keywords Paper

Multidisciplinary Topics and Applications, Knowledge-based Software Engineering, Mining Codebase and Software Repository

0

0

0

0

13:31

03/05/2021

Language-Agnostic Representation Learning of Source Code from Structure and Context

Daniel Zügner, Tobias Kirschstein, Michele Catasta and
Jure Leskovec, Stephan Günnemann

Keywords Paper

code summarization, machine learning for code

0

0

0

0

4:34

03/05/2021

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu and
Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neels Sundaresan, Jian Yin, Daxin Jiang, Ming Zhou

Keywords Paper

Data Flow, Code Structure, Pre-training, Code Representations, BERT

0

0

0

0

5:21

08/12/2020

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models

Jheng-Hong Yang, Sheng-Chieh Lin, Rodrigo Nogueira and
Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Keywords Paper

0

0

0

0

9:14

25/07/2020

Relevance transformer: Generating concise code snippets with relevance feedback

Carlos Gemmell, Federico Rossetto, Jeffrey Dalton

Keywords Paper

code generation, neural machine translation, code retrieval

0

0

0

0

8:44

16/11/2020

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

Yizhe Zhang, Guoyin Wang, Chunyuan Li and
Zhe Gan, Chris Brockett, Bill Dolan

Keywords Paper

language learning, free-form generation, hard-constrained generation, hard-constrained tasks

0

0

0

0

10:09

16/11/2020

PRover: Proof Generation for Interpretable Reasoning over Rules

Swarnadeep Saha, Sayan Ghosh, Shashank Srivastava, Mohit Bansal

Keywords Paper

inference, qa generation, generalization, qa task

0

0

0

0

11:30

12/07/2020

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback

Michihiro Yasunaga, Percy Liang

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

1

14:39

29/06/2020

Improved automatic summarization of subroutines via attention to file context

Sakib Haque, Alexander LeClair, Lingfei Wu, Collin McMillan

Keywords Paper

neural networks, natural language processing, documentation generation, source code summarization, artificial intelligence

0

0

0

0

16:04

19/04/2021

Keep learning: Self-supervised meta-learning for learning from inference

Akhil Kedia, Sai Chetan Chinthakindi

Keywords Paper

0

0

0

0

11:27

16/11/2020

Few-Shot Complex Knowledge Base Question Answering via Meta Reinforcement Learning

Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari and
Guilin Qi, Tongtong Wu

Keywords Paper

program induction, meta-training, cqa, neural approach

0

0

0

0

12:41

06/12/2020

Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang, Linli Xu and
Hao-Ran Wei, Boxing Chen, Enhong Chen

Keywords Paper

0

0

0

0

3:17

02/02/2021

Pragmatic Code Autocomplete

Gabriel Poesia, Noah Goodman

Keywords Paper

0

0

0

0

19:45

19/01/2020

Partial Type Constructors: Or, Making Ad Hoc Datatypes Less Ad Hoc

Mark Jones, J. Garrett Morris, Richard A. Eisenberg

Keywords Paper

Type constructors, Parametric polymorphism

0

0

0

0

21:37

19/04/2021

Maximal multiverse learning for promoting cross-task generalization of fine-tuned language models

Itzik Malkiel, Lior Wolf

Keywords Paper

0

0

0

0

8:32

03/05/2021

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Boxin Wang, Shuohang Wang, Yu Cheng and
Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu

Keywords Paper

adversarial training, QA, NLI, BERT, information theory, adversarial robustness

0

0

0

0

5:21

26/04/2020

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Wenhan Xiong, Jingfei Du, William Yang Wang, Veselin Stoyanov

Keywords Paper

0

0

0

0

5:00

04/07/2020

Span Selection Pre-training for Question Answering

Michael Glass, Alfio Gliozzo, Rishav Chakravarti and
Anthony Ferritto, Lin Pan, G P Shrivatsa Bhargav, Dinesh Garg, Avi Sil

Keywords Paper

Question Answering, language tasks, Next Prediction, pre-training task

0

0

0

0

13:16

16/11/2020

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui and
Wanxiang Che, Ting Liu, Xiangzhan Yu

Keywords Paper

pretraining, pretraining tasks, learning tasks, fine-tuning bert-large

0

0

0

1

10:52

15/06/2020

Semantic code search via equational reasoning

Varot Premtoon, James Koppel, Armando Solar-Lezama

Keywords Paper

equational reasoning, code search

0

0

0

0

16:29

19/10/2020

Feature extraction for large-scale text collections

Luke Gallagher, Antonio Mallia, J. Shane Culpepper and
Torsten Suel, B. Barla Cambazoglu

Keywords Paper

clueweb, feature index, feature extraction, feature repository, lambdamart, ltr, learning to rank, feature importance

0

0

0

0

9:41

16/11/2020

Chaining Behaviors from Data with Model-Free Reinforcement Learning

Avi Singh, Albert Yu, Jonathan Yang and
Jesse Zhang, Aviral Kumar, Sergey Levine

Keywords Paper

0

0

0

0

5:01

16/11/2020

Improving AMR Parsing with Sequence-to-Sequence Pre-training

Dongqin Xu, Junhui Li, Muhua Zhu and
Min Zhang, Guodong Zhou

Keywords Paper

abstract parsing, amr parsing, sequence-to-sequence parsing, machine translation

0

0

0

0

11:42

26/04/2020

Behaviour Suite for Reinforcement Learning

Ian Osband, Yotam Doron, Matteo Hessel and
John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt

Keywords Paper

reinforcement learning, benchmark, core issues, scalability, reproducibility

0

0

0

0

5:58

06/12/2020

MPNet: Masked and Permuted Pre-training for Language Understanding

Kaitao Song, Xu Tan, Tao Qin and
Jianfeng Lu, Tie-Yan Liu

Keywords Paper

0

0

0

0

3:23

04/07/2020

Programming in Natural Language with fuSE: Synthesizing Methods from Spoken Utterances Using Deep Natural Language Understanding

Sebastian Weigelt, Vanessa Steurer, Tobias Hey, Walter F. Tichy

Keywords Paper

intelligent systems, information retrieval, Deep Understanding, end-user programming

0

0

0

0

11:41

22/11/2021

Dynamic Feature Alignment for Semi-supervised Domain Adaptation

Yu Zhang, Gongbo Liang, Nathan Jacobs

Keywords Paper

domain adaptation, semi-supervised learning, image classification, memory bank, feature alignment

0

0

0

0

2:20

19/08/2021

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

15:36

29/06/2020

Embedding java classes with Code2vec: Improvements from variable obfuscation

Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay

Keywords Paper

code2vec, machine learning, code obfuscation, source code, neural networks

0

0

0

0

14:20

02/02/2021

Attributes-Guided and Pure-Visual Attention Alignment for Few-Shot Recognition

Siteng Huang, Min Zhang, Yachen Kang, Donglin Wang

Keywords Paper

0

0

0

0

17:04

02/02/2021

Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney

Keywords Paper

0

0

0

0

18:05

25/07/2020

A pairwise probe for understanding BERT fine-tuning on machine reading comprehension

Jie Cai, Zhengzhou Zhu, Ping Nie, Qian Liu

Keywords Paper

machine reading comprehension, pairwise, fine-tune, BERT

0

0

0

0

6:38

16/11/2020

Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks

Trapit Bansal, Rishikesh Jha, Tsendsuren Munkhdalai, Andrew McCallum

Keywords Paper

nlp applications, fine-tuning, meta-learning problem, supervised tasks

0

0

0

0

11:49

18/07/2021

TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer

Berkay Berabi, Jingxuan He, Veselin Raychev, Martin Vechev

Keywords Paper

Applications

0

0

0

0

5:27

06/12/2020

Unsupervised Translation of Programming Languages

Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, Guillaume Lample

Keywords Paper

0

0

0

0

3:17

06/12/2020

Learning Sparse Prototypes for Text Generation

Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig

Keywords Paper

0

0

0

0

3:22

16/11/2020

Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining

Chengyu Wang, Minghui Qiu, Jun Huang, Xiaofeng He

Keywords Paper

nlp tasks, fine-tuning, learning process, multi-domain tasks

0

0

0

0

9:58