25/07/2020

Multi-level multimodal transformer network for multimodal recipe comprehension

Ao Liu, Shuai Yuan, Chenbin Zhang, Congjian Luo, Yaqing Liao, Kun Bai, Zenglin Xu

Keywords: multimodal machine reading comprehension, multimodal recipe comprehension, question answering

Abstract: Multimodal Machine Comprehension (rm M^3C) has been a challenging task that requires understanding both language and vision, as well as their integration and interaction. For example, the RecipeQA challenge, which provides several rm M^3C tasks, requires deep neural models to understand textual instructions, images of different steps, as well as the logic orders of food cooking. To address this challenge, we propose a Multi-Level Multi-Modal Transformer (MLMM-Trans) framework to integrate and understand multiple textual instructions and multiple images. Our model can conduct intensive attention mechanism at multiple levels of objects (e.g., step level and passage-image level) for sequences of different modalities. Experiments have shown that our model can achieve the state-of-the-art results on the three multimodal tasks of RecipeQA.

The video of this talk cannot be embedded. You can watch it here:
https://dl.acm.org/doi/10.1145/3397271.3401247#sec-supp
(Link will open in new window)
 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at SIGIR 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd

Similar Papers