Two-Stream Spatiotemporal Compositional Attention Network for VideoQA

07/09/2020

Two-Stream Spatiotemporal Compositional Attention Network for VideoQA

Taiki Miyanishi, Takuya Maekawa, Motoaki Kawanabe

Keywords: video question answering

Abstract Paper Similar Papers

Abstract: This study tackles a video question answering (VideoQA), which requires spatiotemporal video reasoning. VideoQA aims to return an appropriate answer about textual questions referring to image frames in the video. In this paper, based on the observation that multiple entities and their movements in the video can be important clues for deriving the correct answer, we propose a two-stream spatiotemporal compositional attention network that achieves sophisticated multi-step spatiotemporal reasoning by using both motion and detailed appearance features. In contrast to the existing video reasoning approach that uses frame-level or clip-level appearance and motion features, our method simultaneously attends detailed appearance features of multiple entities as well as motion features guided by attending words in the textual question. Furthermore, it progressively refines internal representation and infers the answer via multiple reasoning steps. We evaluate our method on short- and long-form VideoQA benchmarks: MSVD-QA, MSRVTT-QA, and ActivityNet-QA and achieve state-of-the-art accuracy on these datasets.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at BMVC 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

04/07/2020

TVQA+: Spatio-Temporal Grounding for Video Question Answering

Jie Lei, Licheng Yu, Tamara Berg, Mohit Bansal

Keywords Paper

Spatio-Temporal Grounding, Video Answering, Spatio-Temporal Answering, Spatio-Temporal Evidence

0

0

0

0

11:42

30/11/2020

Transforming Multi-Concept Attention into Video Summarization

Yen-Ting Liu, Yu-Jhe Li, Yu-Chiang Frank Wang

Keywords Paper

0

0

0

0

7:07

16/11/2020

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Hung Le, Doyen Sahoo, Nancy Chen, Steven C.H. Hoi

Keywords Paper

video-grounded dialogues, high-resolution queries, video setting, bi-directional learning

0

0

0

0

11:05

22/11/2021

Revisiting spatio-temporal layouts for compositional action recognition

Gorjan Radevski, Marie-Francine Moens, Tinne Tuytelaars

Keywords Paper

compositional action recognition, video understanding, something-something, action genome, charades, video transformer, multimodal fusion, spatial reasoning, spatio-temporal action recognition, revisiting spatio-temporal layouts

0

0

0

0

9:58

02/02/2021

Temporal ROI Align for Video Object Recognition

Tao Gong, Kai Chen, Xinjiang Wang and
Qi Chu, Feng Zhu, Dahua Lin, Nenghai Yu, Huamin Feng

Keywords Paper

0

0

0

0

14:29

14/06/2020

Modality Shifting Attention Network for Multi-Modal Video Question Answering

Junyeong Kim, Minuk Ma, Trung Pham and
Kyungsu Kim, Chang D. Yoo

Keywords Paper

mutli-modal video question answering, visual reasoning, vision-langauge interaction, computer vision

0

0

0

0

1:01

02/02/2021

Activity Image-to-Video Retrieval by Disentangling Appearance and Motion

Liu Liu, Jiangtong Li, Li Niu and
Ruicong Xu, Liqing Zhang

Keywords Paper

0

0

0

1

14:34

26/04/2020

CLEVRER: Collision Events for Video Representation and Reasoning

Kexin Yi, Chuang Gan, Yunzhu Li and
Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum

Keywords Paper

Neuro-symbolic, Reasoning

0

0

0

0

4:53

22/11/2021

Diagnosing Errors in Video Relation Detectors

Shuo Chen, Pascal Mettes, Cees Snoek

Keywords Paper

video relation detection, error diagnosis

0

0

0

0

3:02

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

14/06/2020

Violin: A Large-Scale Dataset for Video-and-Language Inference

Jingzhou Liu, Wenhu Chen, Yu Cheng and
Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

Keywords Paper

multimodal understanding, multimodal inference, video understanding

0

0

0

0

1:01

14/06/2020

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

Keywords Paper

temporal grounding, temporal moment retrieval, localization by natural language, video understanding, vision and language

0

0

0

0

1:01

22/11/2021

V3GAN: Decomposing Background, Foreground and Motion for Video Generation

Arti Keshari, Sonam Gupta, Sukhendu Das

Keywords Paper

video generation, unconditional video generation, shuffling loss, feature level masking, unsupervised learning, GAN, foreground, background, motion decomposition

0

0

0

0

3:02

06/12/2021

MAU: A Motion-Aware Unit for Video Prediction and Beyond

Zheng Chang, Xinfeng Zhang, Shanshe Wang and
Siwei Ma, Yan Ye, Xiang Xinguang, Wen Gao

Keywords Paper

vision

0

0

0

0

9:54

06/12/2021

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee and
Yen-Yu Lin, Ming-Hsuan Yang

Keywords Paper

0

0

0

0

14:06

03/05/2021

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Honglu Zhou, Asim Kadav, Farley Lai and
Alexandru Niculescu-Mizil, Martin Min, Mubbasir Kapadia, Hans P Graf

Keywords Paper

Transformer, Video Recognition, Spatiotemporal Understanding, Object Permanence, Multi-hop Reasoning

0

0

0

0

5:10

14/06/2020

Syntax-Aware Action Targeting for Video Captioning

Qi Zheng, Chaoyue Wang, Dacheng Tao

Keywords Paper

video and language, video captioning, action predicting

0

0

0

0

1:01

02/02/2021

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng, Peng Gao, Moitreya Chatterjee and
Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Keywords Paper

0

0

0

0

19:36

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

14/06/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

Keywords Paper

video question answering, visual question answering, conditional relation network, vision-language neural network

0

0

0

0

5:00

05/01/2021

DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando and
Hongdong Li, Stephen Gould

Keywords Paper

0

0

0

0

5:02

02/02/2021

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Lincheng Li, Suzhen Wang, Zhimeng Zhang and
Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan

Keywords Paper

0

0

0

0

15:58

30/11/2020

Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Ardhendu Behera, Zachary Wharton, Pradeep Hewage, Swagat Kumar

Keywords Paper

0

0

0

0

10:19

06/12/2020

Video Frame Interpolation without Temporal Priors

Youjian Zhang, Chaoyue Wang, Dacheng Tao

Keywords Paper

0

0

0

0

3:18

14/06/2020

Detecting Attended Visual Targets in Video

Eunji Chong, Yongxin Wang, Nataniel Ruiz, James M. Rehg

Keywords Paper

attention, gaze, video, dataset, social scene understanding.

0

0

0

0

1:01

19/08/2021

Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

Jie Wu, Wei Zhang, Guanbin Li and
Wenhao Wu, Xiao Tan, Yingying Li, Errui Ding, Liang Lin

Keywords Paper

Computer Vision, Video, Weakly Supervised Learning

0

0

0

0

12:10

25/07/2020

3D self-attention for unsupervised video quantization

Jingkuan Song, Ruimin Lang, Xiaosu Zhu and
Xing Xu, Lianli Gao, Heng Tao Shen

Keywords Paper

quantization, video retrieval, ann search

0

0

0

0

9:44

02/02/2021

Semantic Grouping Network for Video Captioning

Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo

Keywords Paper

0

0

0

0

17:41

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

14/06/2020

Video Modeling With Correlation Networks

Heng Wang, Du Tran, Lorenzo Torresani, Matt Feiszli

Keywords Paper

action recognition, video classification, motion, correlation, temporal information, kinetics, something-something.

0

0

0

0

1:05

14/06/2020

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Karren Yang, Bryan Russell, Justin Salamon

Keywords Paper

audio-visual learning in video, self-supervision, video dataset, spatial audio, localization, spatialization, upmixing, source separation

0

0

0

0

4:41

14/06/2020

Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen

Keywords Paper

multi-granularity attention, video person re-identification, attentive feature aggregation, reference-aided attention, feature relations

0

0

0

0

1:01

30/11/2020

Visually Guided Sound Source Separation using Cascaded Opponent Filter Network

Lingyu Zhu, Esa Rahtu

Keywords Paper

0

0

0

0

7:38

02/02/2021

Motion-blurred Video Interpolation and Extrapolation

Dawit Mureja Argaw, Junsik Kim, Francois Rameau, In So Kweon

Keywords Paper

0

0

0

0

17:28

07/09/2020

Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Yang Wang

Keywords Paper

video thumbnail generation, conditional normalization

0

0

0

0

7:40

02/02/2021

Open Domain Dialogue Generation with Latent Images

Ze Yang, Wei Wu, Huang Hu and
Can Xu, Wei Wang, Zhoujun Li

Keywords Paper

0

0

0

0

15:13

06/12/2021

SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Rishabh Kabra, Daniel Zoran, Goker Erdogan and
Loic Matthey, Antonia Creswell, Matt Botvinick, Alexander Lerchner, Chris Burgess

Keywords Paper

self-supervised learning

0

0

0

0

14:42

06/12/2021

Temporal-attentive Covariance Pooling Networks for Video Recognition

Zilin Gao, Qilong Wang, Bingbing Zhang and
Qinghua Hu, Peihua Li

Keywords Paper

0

0

0

1

8:13

02/02/2021

Proposal-Free Video Grounding with Contextual Pyramid Network

Kun Li, Dan Guo, Meng Wang

Keywords Paper

0

0

0

0

14:19