Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

Abstract: There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives. While model-based metrics perform better than n-gram and embedding based metrics on random negatives, their performance drops substantially when evaluated on adversarial examples. To check if large scale pretraining could help, we propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives (88.27% accuracy). However, its performance again drops substantially, when evaluated on adversarial responses, thereby highlighting that even large-scale pretrained evaluation models are not robust to the adversarial examples in our dataset. The dataset and code are publicly available.

02/02/2021

Algorithms -> Multitask and Transfer Learning; Algorithms -> Representation Learning; Data, Challenges, Implementations, and So, Applications -> Natural Language Processing

3:15

12/07/2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

Ananya Sai, Akash Mohan Kumar, Siddhartha Arora, Mitesh Khapra

Comments

Similar Papers

Re-TACRED: Addressing Shortcomings of the TACRED Dataset

George Stoica, Emmanouil Antonios Platanios, Barnabas Poczos

Keywords Abstract Paper

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Long Chen, Xin Yan, Jun Xiao and Hanwang Zhang, Shiliang Pu, Yueting Zhuang

Keywords Abstract Paper

visual question answering, counterfactual, debias, language bias, data augmentation, visual-and-language

Just Train Twice: Improving Group Robustness without Training Group Information

Evan Liu, Behzad Haghgoo, Annie Chen and Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, Chelsea Finn

Keywords Abstract Paper

Deep Learning

Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals

Zhao Wang, Aron Culotta

Keywords Abstract Paper

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

Peter Hase, Harry Xie, Mohit Bansal

Keywords Abstract Paper

machine learning, interpretability

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Rishi Bommasani, Kelly Davis, Claire Cardie

Keywords Abstract Paper

Interpreting Representations, downstream applications, static embeddings, Pretrained Representations

Suppressing Uncertainties for Large-Scale Facial Expression Recognition

Kai Wang, Xiaojiang Peng, Jianfei Yang and Shijian Lu, Yu Qiao

Keywords Abstract Paper

emotion recognition, self-cure network, uncertainties

Selective Question Answering under Domain Shift

Amita Kamath, Robin Jia, Percy Liang

Keywords Abstract Paper

Selective Answering, Domain Shift, question models, QA models

Seeing without Looking: Contextual Rescoring of Object Detections for AP Maximization

Lourenço V. Pato, Renato Negrinho, Pedro M. Q. Aguiar

Keywords Abstract Paper

object detection, context, rescoring, average precision, non-maximum suppression

With Little Power Comes Great Responsibility

Dallas Card, Peter Henderson, Urvashi Khandelwal and Robin Jia, Kyle Mahowald, Dan Jurafsky

Keywords Abstract Paper

human studies, machine translation, power analysis, power analyses

Fairness via Representation Neutralization

Mengnan Du, Subhabrata Mukherjee, Guanchu Wang and Ruixiang Tang, Ahmed Awadallah, Xia Hu

Keywords Abstract Paper

machine learning, fairness, interpretability

Posterior Calibrated Training on Sentence Classification Tasks

Taehee Jung, Dongyeop Kang, Hua Cheng and Lucas Mentch, Thomas Schaaf

Keywords Abstract Paper

Sentence Tasks, classifications, xSLUE, classification tasks

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Marius Mosbach, Maksym Andriushchenko, Dietrich Klakow

Keywords Abstract Paper

BERT, transfer learning, pretrained language model, fine-tuning stability

Adversarial Distributional Training for Robust Deep Learning

Yinpeng Dong, Zhijie Deng, Tianyu Pang and Jun Zhu, Hang Su

Keywords Abstract Paper

Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?

Anna-Kathrin Kopetzki, Bertrand Charpentier, Daniel Zügner and Sandhya Giri, Stephan Günnemann

Keywords Abstract Paper

Algorithms, Supervised Learning

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Max Bartolo, Alastair Roberts, Johannes Welbl and Sebastian Riedel, Pontus Stenetorp

Keywords Abstract Paper

annotation methodology, annotation process, training, rc models

Robust Pre-Training by Adversarial Contrastive Learning

Ziyu Jiang, Tianlong Chen, Ting Chen, Zhangyang Wang

Keywords Abstract Paper

Learning Debiased Representation via Disentangled Feature Augmentation

Jungsoo Lee, Eungyeup Kim, Juyoung Lee and Jihyeon Lee, Jaegul Choo

Keywords Abstract Paper

machine learning, vision

On Losses for Modern Language Models

Stéphane Aroca-Ouellette, Frank Rudzicz

Keywords Abstract Paper

pre-training, masked modelling, next prediction, nsp

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier and Benjamin Piwowarski, Jacopo Staiano

Keywords Abstract Paper

Multi-task Batch Reinforcement Learning with Metric Learning

Keywords Paper

Long Chen, Xin Yan, Jun Xiao and
Hanwang Zhang, Shiliang Pu, Yueting Zhuang

Keywords Paper

Evan Liu, Behzad Haghgoo, Annie Chen and
Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, Chelsea Finn

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Kai Wang, Xiaojiang Peng, Jianfei Yang and
Shijian Lu, Yu Qiao

Keywords Paper

Keywords Paper

Keywords Paper

Dallas Card, Peter Henderson, Urvashi Khandelwal and
Robin Jia, Kyle Mahowald, Dan Jurafsky

Keywords Paper

Mengnan Du, Subhabrata Mukherjee, Guanchu Wang and
Ruixiang Tang, Ahmed Awadallah, Xia Hu

Keywords Paper

Taehee Jung, Dongyeop Kang, Hua Cheng and
Lucas Mentch, Thomas Schaaf

Keywords Paper

Keywords Paper

Yinpeng Dong, Zhijie Deng, Tianyu Pang and
Jun Zhu, Hang Su

Keywords Paper

Anna-Kathrin Kopetzki, Bertrand Charpentier, Daniel Zügner and
Sandhya Giri, Stephan Günnemann

Keywords Paper

Max Bartolo, Alastair Roberts, Johannes Welbl and
Sebastian Riedel, Pontus Stenetorp

Keywords Paper

Keywords Paper

Jungsoo Lee, Eungyeup Kim, Juyoung Lee and
Jihyeon Lee, Jaegul Choo

Keywords Paper

Keywords Paper

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier and
Benjamin Piwowarski, Jacopo Staiano

Keywords Paper

Jiachen Li, Quan Vuong, Shuang Liu and
Minghua Liu, Kamil Ciosek, Henrik Christensen, Hao Su

Keywords Paper

Muhammad Asim, Max Daniels, Oscar Leong and
Paul Hand, Ali Ahmed

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta and
Jesse Dodge, Noah A. Smith

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro and
Aaron Courville, Marc Bellemare

Keywords Paper

Daisheng Jin, Xiao Ma, Chongzhi Zhang and
Yizhuo Zhou, Jiashu Tao, Mingyuan Zhang, Zhoujun Li

Keywords Paper

Le Zhang, Ryu Tanno, Moucheng Xu and
Chen Jin, Joseph Jacob, Olga Cicarrelli, Frederik Barkhof, Daniel Alexander

Keywords Paper

Keywords Paper

Yangyang Guo, Liqiang Nie, Zhiyong Cheng and
Feng Ji, Ji Zhang, Alberto Del Bimbo

Keywords Paper

Keywords Paper

Keywords Paper