Abstract:
This work proposes a new challenge set for multimodal classification, focusing on
detecting hate speech in multimodal memes. It is constructed such that unimodal
models struggle and only multimodal models can succeed: difficult examples
(“benign confounders”) are added to the dataset to make it hard to rely on unimodal
signals. The task requires subtle reasoning, yet is straightforward to evaluate
as a binary classification problem. We provide baseline performance numbers
for unimodal models, as well as for multimodal models with various degrees of
sophistication. We find that state-of-the-art methods perform poorly compared to
humans, illustrating the difficulty of the task and highlighting the challenge that this important problem poses to the community.