Beyond Accuracy: Behavioral Testing of NLP Models with Checklist (Extended Abstract)

Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

12/08/2020

robustness, vqa, causality, gan, dataset, evaluation, automated semantic scene editing, data augmentation, invariance, covariance

1:00

13/04/2021

Beyond Accuracy: Behavioral Testing of NLP Models with Checklist (Extended Abstract)

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Comments

Similar Papers

AURORA: Statistical Crash Analysis for Automated Root Cause Explanation

Tim Blazytko, Moritz Schlögel, Cornelius Aschermann and Ali Abbasi, Joel Frank, Simon Wörner, Thorsten Holz

Keywords Abstract Paper

A Large-Scale Longitudinal Study of Flaky Tests

Wing Lam, Stefan Winter, Anjiang Wei and Tao Xie, Darko Marinov, Jonathan Bell

Keywords Abstract Paper

flaky test, regression testing

Fuzzing Error Handling Code using Context-Sensitive Software Fault Injection

Zu-Ming Jiang, Jia-Ju Bai, Kangjie Lu, Shi-Min Hu

Keywords Abstract Paper

Foundations of Empirical Memory Consistency Testing

Jake Kirkham, Tyler Sorensen, Esin Tureci, Margaret Martonosi

Keywords Abstract Paper

autotuning, conformance testing, memory consistency, GPUs, OpenCL

RTPTorrent: An open-source dataset for evaluating regression test prioritization

Toni Mattis, Patrick Rein, Falco Dürsch, Robert Hirschfeld

Keywords Abstract Paper

Regression Test Prioritization, Dataset, Java, GitHub, TravisCI

PClean: Bayesian data cleaning at scale with domain-specific probabilistic programming

Alexander Lew, Monica Agrawal, David Sontag, Vikash Mansinghka

Keywords Abstract Paper

Testing Database Engines via Pivoted Query Synthesis

Manuel Rigger, Zhendong Su

Keywords Abstract Paper

Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning

Hanlu Wu, Tengfei Ma, Lingfei Wu and Tariro Manyumwa, Shouling Ji

Keywords Abstract Paper

summarization task, document system, rouge, unsupervised learning

Revisiting adversarially learned injection attacks against recommender systems

Jiaxi Tang, Hongyi Wen, Ke Wang

Keywords Abstract Paper

Recommender System, Security and Privacy, Adversarial Machine Learning

Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing

Vedika Agarwal, Rakshith Shetty, Mario Fritz

Keywords Abstract Paper

robustness, vqa, causality, gan, dataset, evaluation, automated semantic scene editing, data augmentation, invariance, covariance

Good classifiers are abundant in the interpolating regime

Ryan Theisen, Jason Klusowski, Michael Mahoney

Keywords Abstract Paper

On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Damien Teney, Ehsan Abbasnejad, Kushal Kafle and Robik Shrestha, Christopher Kanan, Anton van den Hengel

Keywords Abstract Paper

Hidden biases in unreliable news detection datasets

Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos and Thomas Butler, Mohit Bansal

Keywords Abstract Paper

Unsupervised Opinion Summarization with Noising and Denoising

Reinald Kim Amplayo, Mirella Lapata

Keywords Abstract Paper

Unsupervised Summarization, supervised models, abstractive summarization, Noising

Few-Shot Learning for Opinion Summarization

Arthur Bražinskas, Mirella Lapata, Ivan Titov

Keywords Abstract Paper

opinion summarization, automatic text, summary production, summarization mode

Fast, Sound, and Effectively Complete Dynamic Race Prediction

Andreas Pavlogiannis

Keywords Abstract Paper

race detection, predictive analyses, concurrency

Precise Inference of Expressive Units of Measurement Types

Tongtong Xiang, Jeff Y. Luo, Werner Dietl

Keywords Abstract Paper

Scientific computing, Pluggable type system, Dimensional analysis, Units of measurements, Type inference

Feature extraction for large-scale text collections

Luke Gallagher, Antonio Mallia, J. Shane Culpepper and Torsten Suel, B. Barla Cambazoglu

Keywords Abstract Paper

clueweb, feature index, feature extraction, feature repository, lambdamart, ltr, learning to rank, feature importance

Execution Reconstruction: Harnessing Failure Reoccurrences for Failure Reproduction

Gefei Zuo, Jiacheng Ma, Andrew Quinn and Pramod Bhatotia, Pedro Fonseca, Baris Kasikci

Keywords Abstract Paper

debugging, symbolic execution

Bayesian Multi-type Mean Field Multi-agent Imitation Learning

Fan Yang, Alina Vereshchaka, Changyou Chen, Wen Dong

Keywords Abstract Paper

With Little Power Comes Great Responsibility

Dallas Card, Peter Henderson, Urvashi Khandelwal and Robin Jia, Kyle Mahowald, Dan Jurafsky

Keywords Abstract Paper

human studies, machine translation, power analysis, power analyses

Tim Blazytko, Moritz Schlögel, Cornelius Aschermann and
Ali Abbasi, Joel Frank, Simon Wörner, Thorsten Holz

Keywords Paper

Wing Lam, Stefan Winter, Anjiang Wei and
Tao Xie, Darko Marinov, Jonathan Bell

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Hanlu Wu, Tengfei Ma, Lingfei Wu and
Tariro Manyumwa, Shouling Ji

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Damien Teney, Ehsan Abbasnejad, Kushal Kafle and
Robik Shrestha, Christopher Kanan, Anton van den Hengel

Keywords Paper

Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos and
Thomas Butler, Mohit Bansal

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Luke Gallagher, Antonio Mallia, J. Shane Culpepper and
Torsten Suel, B. Barla Cambazoglu

Keywords Paper

Gefei Zuo, Jiacheng Ma, Andrew Quinn and
Pramod Bhatotia, Pedro Fonseca, Baris Kasikci

Keywords Paper

Keywords Paper

Dallas Card, Peter Henderson, Urvashi Khandelwal and
Robin Jia, Kyle Mahowald, Dan Jurafsky

Keywords Paper

Keywords Paper

Shreya Gupta, Parantak Singh, Megha Sundriyal and
Md. Shad Akhtar, Tanmoy Chakraborty

Keywords Paper

Keywords Paper

Stephanie C.Y. Chan, Samuel Fishman, Anoop Korattikara and
John Canny, Sergio Guadarrama

Keywords Paper

Keywords Paper

Ziyu Yao, Yiqi Tang, Wen-tau Yih and
Huan Sun, Yu Su

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Chengcheng Xiang, Haochen Huang, Andrew Yoo and
Yuanyuan Zhou, Shankar Pasupathy

Keywords Paper

Keywords Paper

Hongxu Chen, Shengjian Guo, Yinxing Xue and
Yulei Sui, Cen Zhang, Yuekang Li, Haijun Wang, Yang Liu

Keywords Paper

Keywords Paper