Abstract:
Creating machine learning datasets often necessitates the use of automated data retrieval and crowdsourced annotation, giving rise to an inevitably noisy pipeline. We perform large-scale human studies to investigate the impact of such a pipeline on ImageNet---one of the key datasets driving progress in computer vision. We find that seemingly innocuous design choices (e.g., exact task setup, filtering procedure, annotators employed) can have an unexpected impact on the resulting dataset---including the introduction of spurious correlations that state-of-the-art models exploit. Overall, our results highlight a misalignment between the way we train our models and the task we actually expect them to solve, emphasizing the need for fine-grained evaluation techniques that go beyond average-case accuracy.