TVQA+: Spatio-Temporal Grounding for Video Question Answering

04/07/2020

TVQA+: Spatio-Temporal Grounding for Video Question Answering

Jie Lei, Licheng Yu, Tamara Berg, Mohit Bansal

Keywords: Spatio-Temporal Grounding, Video Answering, Spatio-Temporal Answering, Spatio-Temporal Evidence

Abstract Paper Similar Papers

Abstract: We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos. We first augment the TVQA dataset with 310.8K bounding boxes, linking depicted objects to visual concepts in questions and answers. We name this augmented version as TVQA+. We then propose Spatio-Temporal Answerer with Grounded Evidence (STAGE), a unified framework that grounds evidence in both spatial and temporal domains to answer questions about videos. Comprehensive experiments and analyses demonstrate the effectiveness of our framework and how the rich annotations in our TVQA+ dataset can contribute to the question answering task. Moreover, by performing this joint task, our model is able to produce insightful and interpretable spatio-temporal attention visualizations.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACL 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

07/09/2020

Two-Stream Spatiotemporal Compositional Attention Network for VideoQA

Taiki Miyanishi, Takuya Maekawa, Motoaki Kawanabe

Keywords Paper

video question answering

0

0

0

0

2:02

14/06/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

Keywords Paper

video question answering, visual question answering, conditional relation network, vision-language neural network

0

0

0

0

5:00

14/06/2020

Detecting Attended Visual Targets in Video

Eunji Chong, Yongxin Wang, Nataniel Ruiz, James M. Rehg

Keywords Paper

attention, gaze, video, dataset, social scene understanding.

0

0

0

0

1:01

30/11/2020

Transforming Multi-Concept Attention into Video Summarization

Yen-Ting Liu, Yu-Jhe Li, Yu-Chiang Frank Wang

Keywords Paper

0

0

0

0

7:07

19/08/2021

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

14:06

16/11/2020

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee and
Chitta Baral, Yezhou Yang

Keywords Paper

captioning, video understanding, video captioning, generating captions

0

0

0

0

12:02

02/02/2021

Temporal ROI Align for Video Object Recognition

Tao Gong, Kai Chen, Xinjiang Wang and
Qi Chu, Feng Zhu, Dahua Lin, Nenghai Yu, Huamin Feng

Keywords Paper

0

0

0

0

14:29

30/11/2020

Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

Yan-Bo Lin, Yu-Chiang Frank Wang

Keywords Paper

0

0

0

0

5:44

02/02/2021

Arbitrary Video Style Transfer via Multi-Channel Correlation

Yingying Deng, Fan Tang, Weiming Dong and
Haibin Huang, Chongyang Ma, Changsheng Xu

Keywords Paper

0

0

0

0

14:55

14/06/2020

Violin: A Large-Scale Dataset for Video-and-Language Inference

Jingzhou Liu, Wenhu Chen, Yu Cheng and
Zhe Gan, Licheng Yu, Yiming Yang, Jingjing Liu

Keywords Paper

multimodal understanding, multimodal inference, video understanding

0

0

0

0

1:01

07/09/2020

Attention Distillation for Learning Video Representations

Miao Liu, Xin Chen, Yun Zhang and
Yin Li, James Rehg

Keywords Paper

Action Recognition, Deep Learning, Representation Learning

0

0

0

0

9:50

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

14/06/2020

Syntax-Aware Action Targeting for Video Captioning

Qi Zheng, Chaoyue Wang, Dacheng Tao

Keywords Paper

video and language, video captioning, action predicting

0

0

0

0

1:01

22/11/2021

Revisiting spatio-temporal layouts for compositional action recognition

Gorjan Radevski, Marie-Francine Moens, Tinne Tuytelaars

Keywords Paper

compositional action recognition, video understanding, something-something, action genome, charades, video transformer, multimodal fusion, spatial reasoning, spatio-temporal action recognition, revisiting spatio-temporal layouts

0

0

0

0

9:58

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

16/11/2020

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Alexander Ku, Peter Anderson, Roma Patel and
Eugene Ie, Jason Baldridge

Keywords Paper

multitask learning, embodied agents, vln, rxr

0

0

0

0

13:05

05/01/2021

The IKEA ASM Dataset: Understanding People Assembling Furniture Through Actions, Objects and Pose

Yizhak Ben-Shabat, Xin Yu, Fatemeh Saleh and
Dylan Campbell, Cristian Rodriguez-Opazo, Hongdong Li, Stephen Gould

Keywords Paper

0

0

0

0

3:58

14/06/2020

Video Super-Resolution With Temporal Group Attention

Takashi Isobe, Songjiang Li, Xu Jia and
Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, Qi Tian

Keywords Paper

video processing, video super-resolution

0

0

0

0

1:00

22/11/2021

CTRN: Class-Temporal Relational Network for Action Detection

Rui Dai, Srijan Das, Francois Bremond

Keywords Paper

action detection, graph reasoning, graph convolutional network, temporal modelling, multi-label classification

0

0

0

0

7:02

22/11/2021

Single-Modal Entropy based Active Learning for Visual Question Answering

Dong-Jin Kim, Jae Won Cho, Jinsoo Choi and
Yunjae Jung, In So Kweon

Keywords Paper

Visual Question Answering, Vision and Language, Active Learning

0

0

0

0

2:42

22/11/2021

Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers

Arthur Jian Shun Lam, Jun Yi Lim, Ricky Sutopo, Vishnu Monn Baskaran

Keywords Paper

object detection, atrous convolution, vision transformers, attention mechanism

0

0

0

0

3:01

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

14/06/2020

Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen

Keywords Paper

multi-granularity attention, video person re-identification, attentive feature aggregation, reference-aided attention, feature relations

0

0

0

0

1:01

07/09/2020

Mid-level Fusion for End-to-End Temporal Activity Detection in Untrimmed Video

Md Atiqur Rahman, Robert Laganiere

Keywords Paper

temporal activity detection, action detection, untrimmed video processing, single-stage detection

0

0

0

0

9:46

06/12/2021

SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Rishabh Kabra, Daniel Zoran, Goker Erdogan and
Loic Matthey, Antonia Creswell, Matt Botvinick, Alexander Lerchner, Chris Burgess

Keywords Paper

self-supervised learning

0

0

0

0

14:42

02/02/2021

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng, Peng Gao, Moitreya Chatterjee and
Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Keywords Paper

0

0

0

0

19:36

02/02/2021

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Yang Fu, Linjie Yang, Ding Liu and
Thomas S. Huang, Humphrey Shi

Keywords Paper

0

0

0

0

16:24

16/11/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Linjie Li, Yen-Chun Chen, Yu Cheng and
Zhe Gan, Licheng Yu, Jingjing Liu

Keywords Paper

large-scale learning, pre-training tasks, video-subtitle matching, text-based retrieval

0

0

0

0

11:47

30/11/2020

Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Ardhendu Behera, Zachary Wharton, Pradeep Hewage, Swagat Kumar

Keywords Paper

0

0

0

0

10:19

06/12/2021

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Keywords Paper

self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

8:46

06/12/2020

RANet: Region Attention Network for Semantic Segmentation

Dingguo Shen, Yuanfeng Ji, Ping Li and
Yi Wang, Di Lin

Keywords Paper

0

0

0

0

3:13

14/06/2020

EmotiCon: Context-Aware Multimodal Emotion Recognition Using Frege’s Principle

Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya and
Rohan Chandra, Aniket Bera, Dinesh Manocha

Keywords Paper

affective computing, perceived emotions, context understanding, multimodal, inter-agent interactions, depth maps, deep learning, background, attention maps

0

0

0

0

1:00

26/04/2020

Theory and Evaluation Metrics for Learning Disentangled Representations

Kien Do, Truyen Tran

Keywords Paper

disentanglement, metrics

0

0

0

0

3:37

16/11/2020

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

Jie Lei, Licheng Yu, Tamara Berg, Mohit Bansal

Keywords Paper

video-and-language prediction, ai models, vlep, adversarial procedure

0

0

0

0

11:58

02/02/2021

Activity Image-to-Video Retrieval by Disentangling Appearance and Motion

Liu Liu, Jiangtong Li, Li Niu and
Ruicong Xu, Liqing Zhang

Keywords Paper

0

0

0

1

14:34

05/01/2021

DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando and
Hongdong Li, Stephen Gould

Keywords Paper

0

0

0

0

5:02

05/01/2021

Interpretable and Trustworthy Deepfake Detection via Dynamic Prototypes

Loc Trinh, Michael Tsang, Sirisha Rambhatla, Yan Liu

Keywords Paper

0

0

0

0

5:00

14/06/2020

Image Search With Text Feedback by Visiolinguistic Attention Learning

Yanbei Chen, Shaogang Gong, Loris Bazzani

Keywords Paper

vision and language, image search, text feedback, attention mechanism, transformer, multimodal learning, representation learning, composition, image retrieval, interactive image search

0

0

0

0

1:00

14/06/2020

Screencast Tutorial Video Understanding

Kunpeng Li, Chen Fang, Zhaowen Wang and
Seokhwan Kim, Hailin Jin, Yun Fu

Keywords Paper

screencast tutorials, video understanding, text-to-video retrieval, tutorial video captioning

0

0

0

0

1:01