DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

06/12/2021

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, Guillaume Lample

Keywords: self-supervised learning

Abstract Paper Similar Papers

Abstract: Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

19/04/2021

Cross-lingual visual pre-training for multimodal machine translation

Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac and
Pranava Madhyastha, Erkut Erdem, Aykut Erdem, Lucia Specia

Keywords Paper

0

0

0

0

6:16

26/04/2020

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Wenhan Xiong, Jingfei Du, William Yang Wang, Veselin Stoyanov

Keywords Paper

0

0

0

0

5:00

12/07/2020

Learning and Evaluating Contextual Embedding of Source Code

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi

Keywords Paper

Representation Learning

0

0

0

0

12:51

16/11/2020

Meta Fine-Tuning Neural Language Models for Multi-Domain Text Mining

Chengyu Wang, Minghui Qiu, Jun Huang, Xiaofeng He

Keywords Paper

nlp tasks, fine-tuning, learning process, multi-domain tasks

0

0

0

0

9:58

19/08/2021

Automatically Paraphrasing via Sentence Reconstruction and Round-trip Translation

Zilu Guo, Zhongqiang Huang, Kenny Q. Zhu and
Guandan Chen, Kaibo Zhang, Boxing Chen, Fei Huang

Keywords Paper

Natural Language Processing, Machine Translation, Natural Language Generation, NLP Applications and Tools

0

0

0

0

13:53

16/11/2020

PatchBERT: Just-in-Time, Out-of-Vocabulary Patching

Sangwhan Moon, Naoaki Okazaki

Keywords Paper

natural processing, downstream tasks, mitigation, large models

0

0

0

0

7:02

04/07/2020

Towards Unsupervised Language Understanding and Generation by Joint Dual Learning

Shang-Yu Su, Chao-Wei Huang, Yun-Nung Chen

Keywords Paper

Unsupervised Understanding, Unsupervised Generation, natural understanding, natural generation

0

0

0

0

8:23

16/11/2020

SLM: Learning a Discourse Language Representation with Sentence Unshuffling

Haejun Lee, Drew A. Hudson, Kangwook Lee, Christopher D. Manning

Keywords Paper

nlp, sentence-level modeling, discourse representation, pre-training methods

0

0

0

0

9:21

19/04/2021

Keep learning: Self-supervised meta-learning for learning from inference

Akhil Kedia, Sai Chetan Chinthakindi

Keywords Paper

0

0

0

0

11:27

16/11/2020

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Sanyuan Chen, Yutai Hou, Yiming Cui and
Wanxiang Che, Ting Liu, Xiangzhan Yu

Keywords Paper

pretraining, pretraining tasks, learning tasks, fine-tuning bert-large

0

0

0

1

10:52

04/07/2020

Span Selection Pre-training for Question Answering

Michael Glass, Alfio Gliozzo, Rishav Chakravarti and
Anthony Ferritto, Lin Pan, G P Shrivatsa Bhargav, Dinesh Garg, Avi Sil

Keywords Paper

Question Answering, language tasks, Next Prediction, pre-training task

0

0

0

0

13:16

16/11/2020

Autoregressive Knowledge Distillation through Imitation Learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei

Keywords Paper

natural tasks, knowledge distillation, exposure problem, prototypical tasks

0

0

0

0

12:43

16/11/2020

Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks

Trapit Bansal, Rishikesh Jha, Tsendsuren Munkhdalai, Andrew McCallum

Keywords Paper

nlp applications, fine-tuning, meta-learning problem, supervised tasks

0

0

0

0

11:49

18/07/2021

Self-supervised and Supervised Joint Training for Resource-rich Machine Translation

Yong Cheng, Wei Wang, Lu Jiang, Wolfgang Macherey

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:21

03/05/2021

Deberta: Decoding-Enhanced Bert With Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

Keywords Paper

Position Encoding, Attention, Natural Language Processing, Language Model Pre-training, Transformer

0

0

0

0

6:06

05/12/2020

Heads-up! Unsupervised constituency parsing via self-attention heads

Bowen Li, Taeuk Kim, Reinald Kim Amplayo, Frank Keller

Keywords Paper

0

0

0

0

13:55

16/11/2020

Cross-Thought for Sentence Encoder Pre-training

Shuohang Wang, Yuwei Fang, Siqi Sun and
Zhe Gan, Yu Cheng, Jingjing Liu, Jing Jiang

Keywords Paper

pre-training encoder, large-scale tasks, question answering, predicting words

0

0

0

0

12:06

04/07/2020

Leveraging Pre-trained Checkpoints for Sequence Generation Tasks

Sascha Rothe, Shashi Narayan and Aliaksei Severyn

Keywords Paper

Sequence Tasks, Natural Processing, Natural tasks, Sequence Generation

0

0

0

0

13:12

04/07/2020

Roles and Utilization of Attention Heads in Transformer-based Neural Language Models

Jae-young Jo, Sung-Hyon Myaeng

Keywords Paper

Transformer-based Models, natural tasks, downstream tasks, probing tasks

0

0

0

0

12:17

16/11/2020

PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation

Bin Bi, Chenliang Li, Chen Wu and
Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si

Keywords Paper

natural generation, language tasks, generative answering, conversational generation

0

0

0

0

11:02

26/04/2020

StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding

Wei Wang, Bin Bi, Ming Yan and
Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, Luo Si

Keywords Paper

0

0

0

0

5:34

16/11/2020

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

Yizhe Zhang, Guoyin Wang, Chunyuan Li and
Zhe Gan, Chris Brockett, Bill Dolan

Keywords Paper

language learning, free-form generation, hard-constrained generation, hard-constrained tasks

0

0

0

0

10:09

16/11/2020

Improving AMR Parsing with Sequence-to-Sequence Pre-training

Dongqin Xu, Junhui Li, Muhua Zhu and
Min Zhang, Guodong Zhou

Keywords Paper

abstract parsing, amr parsing, sequence-to-sequence parsing, machine translation

0

0

0

0

11:42

06/12/2020

Learning Sparse Prototypes for Text Generation

Junxian He, Taylor Berg-Kirkpatrick, Graham Neubig

Keywords Paper

0

0

0

0

3:22

16/11/2020

Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning

Xiaomian Kang, Yang Zhao, Jiajun Zhang, Chengqing Zong

Keywords Paper

document-level translation, translations, document-level model, selection module

0

0

0

0

11:36

04/07/2020

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Jiaao Chen, Zichao Yang, Diyi Yang

Keywords Paper

Semi-Supervised Classification, text classification, data augmentation, supervision

0

0

0

0

10:54

04/07/2020

Curriculum Learning for Natural Language Understanding

Benfeng Xu, Licheng Zhang, Zhendong Mao and
Quan Wang, Hongtao Xie, Yongdong Zhang

Keywords Paper

Curriculum Learning, Natural Understanding, natural tasks, NLU tasks

0

0

0

0

9:41

16/11/2020

The Thieves on Sesame Street are Polyglots - Extracting Multilingual Models from Monolingual APIs

Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, Richard Socher

Keywords Paper

pre-training, natural processing, victim model, extraction process

0

0

0

0

7:01

04/07/2020

Language to Network: Conditional Parameter Adaptation with Natural Language Descriptions

Tian Jin, Zhun Liu, Shengjia Yan and
Alexandre Eichenberger, Louis-Philippe Morency

Keywords Paper

Transfer learning, computer tasks, fine-tuning, Conditional Adaptation

0

0

0

0

5:42

02/02/2021

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings?

Xuancheng Ren, Xu Sun, Houfeng Wang, Qun Liu

Keywords Paper

0

0

0

0

16:04

04/07/2020

IMoJIE: Iterative Memory-Based Joint Open Information Extraction

Keshav Kolluru, Samarth Aggarwal, Vipul Rathore and
Mausam -, Soumen Chakrabarti

Keywords Paper

Iterative Extraction, Open Extraction, IMoJIE, Iterative

0

0

0

0

9:31

05/12/2020

Investigating learning dynamics of BERT fine-tuning

Yaru Hao, Li Dong, Furu Wei, Ke Xu

Keywords Paper

0

0

0

0

7:10

14/06/2020

ScrabbleGAN: Semi-Supervised Varying Length Handwritten Text Generation

Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen and
Shai Mazor, Roee Litman

Keywords Paper

gan, semi-supervised, domain-adaptation, handwriting, generative, unlabeled, transfer learning, ocr, text, augmentation

0

0

0

0

1:01

04/07/2020

Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus

Hao Fei, Meishan Zhang, Donghong Ji

Keywords Paper

Cross-Lingual Labeling, semantic labeling, natural understanding, model transferring

0

0

0

0

10:32

08/12/2020

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models

Jheng-Hong Yang, Sheng-Chieh Lin, Rodrigo Nogueira and
Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Keywords Paper

0

0

0

0

9:14

04/07/2020

Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction

Masahiro Kaneko, Masato Mita, Shun Kiyono and
Jun Suzuki, Kentaro Inui

Keywords Paper

Grammatical Correction, GEC, Encoder-Decoder Models, Pre-trained Models

0

0

0

0

6:44

16/11/2020

Local Additivity Based Data Augmentation for Semi-supervised NER

Jiaao Chen, Zhenghui Wang, Ran Tian and
Zichao Yang, Diyi Yang

Keywords Paper

named recognition, deep understanding, semi-supervised ner, entity learning

0

0

0

0

11:18

06/12/2020

Towards Interpretable Natural Language Understanding with Explanations as Latent Variables

Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang and
Xiaodan Liang, Maosong Sun, Chenyan Xiong, Jian Tang

Keywords Paper

0

0

0

0

3:22

16/11/2020

Self-Paced Learning for Neural Machine Translation

Yu Wan, Baosong Yang, Derek F. Wong and
Yikai Zhou, Lidia S. Chao, Haibo Zhang, Boxing Chen

Keywords Paper

neural, curriculum learning, translation tasks, nmt

0

0

0

0

6:03

16/11/2020

Template Guided Text Generation for Task-Oriented Dialogue

Mihir Kale, Abhinav Rastogi

Keywords Paper

natural generation, generation, nlg, domain-independent model

0

0

0

0

12:25