How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation

Abstract: Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their 'difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average.

02/02/2021

auto ml, hyperparameter optimization, meta learning, task aware, hyperband, hyperparameters, warm start, image classication, resnet, shufflenet

4:58

03/05/2021

What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study

Marcin Andrychowicz, Anton Raichuk, Piotr Stanczyk and
Manu Orsini, Sertan Girgin, Raphaël Marinier, Hussenot Hussenot-Desenonges, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, Olivier Bachem

Natural Language Processing, Resources and Evaluation, NLP Applications and Tools, Text Classification, Question Answering

14:26

06/12/2020

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi and
Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gael Varoquaux, Pascal Vincent

Keywords Paper

19:40

05/04/2021

Accounting for Variance in Machine Learning Benchmarks

Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi and
Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gael Varoquaux, Pascal Vincent

Sebastien Levy, Randolph Yao, Youjiang Wu and
Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, Murali Chintalapati

Keywords Paper

18:04

05/04/2021

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Yue Zhao, Xiyang Hu, Cheng Cheng and
Cong Wang, Changlin Wan, Wen Wang, Jianing Yang, Haoping Bai, Zheng Li, Cao Xiao, Yunlong Wang, Zhi Qiao, Jimeng Sun, Leman Akoglu

Keywords Paper

Algorithms -> Adversarial Learning, Algorithms -> Image Segmentation; Algorithms -> Semi-Supervised Learning; Applications -> Computer Vision; Applications -> Imag

18:47

05/04/2021

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Yue Zhao, Xiyang Hu, Cheng Cheng and
Cong Wang, Changlin Wan, Wen Wang, Jianing Yang, Haoping Bai, Zheng Li, Cao Xiao, Yunlong Wang, Zhi Qiao, Jimeng Sun, Leman Akoglu

Keywords Paper

Algorithms -> Adversarial Learning, Algorithms -> Image Segmentation; Algorithms -> Semi-Supervised Learning; Applications -> Computer Vision; Applications -> Imag

4:53

16/11/2020