07/09/2020

Two-Stream Spatiotemporal Compositional Attention Network for VideoQA

Taiki Miyanishi, Takuya Maekawa, Motoaki Kawanabe

Keywords: video question answering

Abstract: This study tackles a video question answering (VideoQA), which requires spatiotemporal video reasoning. VideoQA aims to return an appropriate answer about textual questions referring to image frames in the video. In this paper, based on the observation that multiple entities and their movements in the video can be important clues for deriving the correct answer, we propose a two-stream spatiotemporal compositional attention network that achieves sophisticated multi-step spatiotemporal reasoning by using both motion and detailed appearance features. In contrast to the existing video reasoning approach that uses frame-level or clip-level appearance and motion features, our method simultaneously attends detailed appearance features of multiple entities as well as motion features guided by attending words in the textual question. Furthermore, it progressively refines internal representation and infers the answer via multiple reasoning steps. We evaluate our method on short- and long-form VideoQA benchmarks: MSVD-QA, MSRVTT-QA, and ActivityNet-QA and achieve state-of-the-art accuracy on these datasets.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at BMVC 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers