End-to-end Multi-modal Video Temporal Grounding

06/12/2021

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Keywords: self-supervised learning, transformers, vision, contrastive learning

Abstract Paper Similar Papers

Abstract: We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2021

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Aadarsh Sahoo, Rutav Shah, Rameswar Panda and
Kate Saenko, Abir Das

Keywords Paper

domain adaptation, contrastive learning

0

0

0

0

13:20

22/11/2021

Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions

Zhao Yang, Yansong Tang, Luca Bertinetto and
Hengshuang Zhao, Philip Torr

Keywords Paper

segmentation, video object segmentation, referring segmentation, referring video object segmentation, video object segmentation from referring expressions, referring image segmentation, referring image comprehension, optical flow, visual grounding

0

0

0

0

2:57

22/11/2021

GTA: Global Temporal Attention for Video Action Understanding

Bo He, Xitong Yang, Zuxuan Wu and
Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

Keywords Paper

action recognition, self-attention, temporal modeling

0

0

0

0

2:55

14/06/2020

Image Search With Text Feedback by Visiolinguistic Attention Learning

Yanbei Chen, Shaogang Gong, Loris Bazzani

Keywords Paper

vision and language, image search, text feedback, attention mechanism, transformer, multimodal learning, representation learning, composition, image retrieval, interactive image search

0

0

0

0

1:00

02/02/2021

Arbitrary Video Style Transfer via Multi-Channel Correlation

Yingying Deng, Fan Tang, Weiming Dong and
Haibin Huang, Chongyang Ma, Changsheng Xu

Keywords Paper

0

0

0

0

14:55

14/06/2020

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah

Keywords Paper

segmentation, localization, video, capsule, natural language, action, a2d, routing

0

0

0

0

4:58

14/06/2020

Learning Selective Self-Mutual Attention for RGB-D Saliency Detection

Nian Liu, Ni Zhang, Junwei Han

Keywords Paper

rgb-d saliency detection, middle fusion, self-attention, mutual-attention, non-local network, two-stream cnn

0

0

0

0

1:01

22/11/2021

LARNet: Latent Action Representation for Human Action Synthesis

Naman Biyani, Aayush Jung Bahadur Rana, Shruti Vyas, Yogesh Rawat

Keywords Paper

action synthesis, video synthesis, joint generative model, human action generation, end-to-end learning, conditional video generation

0

0

0

0

3:02

22/11/2021

Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers

Arthur Jian Shun Lam, Jun Yi Lim, Ricky Sutopo, Vishnu Monn Baskaran

Keywords Paper

object detection, atrous convolution, vision transformers, attention mechanism

0

0

0

0

3:01

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

02/02/2021

Learning Visual Context for Group Activity Recognition

Hangjie Yuan, Dong Ni

Keywords Paper

0

0

0

0

16:54

14/06/2020

Cross-Modality Person Re-Identification With Shared-Specific Feature Transfer

Yan Lu, Yue Wu, Bin Liu and
Tianzhu Zhang, Baopu Li, Qi Chu, Nenghai Yu

Keywords Paper

person re-identification, cross modality

0

0

0

0

0:56

02/02/2021

Proposal-Free Video Grounding with Contextual Pyramid Network

Kun Li, Dan Guo, Meng Wang

Keywords Paper

0

0

0

0

14:19

14/06/2020

Syntax-Aware Action Targeting for Video Captioning

Qi Zheng, Chaoyue Wang, Dacheng Tao

Keywords Paper

video and language, video captioning, action predicting

0

0

0

0

1:01

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

05/01/2021

Cross-Domain Latent Modulation for Variational Transfer Learning

Jinyong Hou, Jeremiah D. Deng, Stephen Cranefield, Xuejie Ding

Keywords Paper

0

0

0

0

4:52

14/06/2020

Video Instance Segmentation Tracking With a Modified VAE Architecture

Chung-Ching Lin, Ying Hung, Rogerio Feris, Linglin He

Keywords Paper

video instance segmentation, video object tracking, variational autoencoder, vae, gaussian process, multi-task learning

0

0

0

0

1:01

14/06/2020

JL-DCF: Joint Learning and Densely-Cooperative Fusion Framework for RGB-D Salient Object Detection

Keren Fu, Deng-Ping Fan, Ge-Peng Ji, Qijun Zhao

Keywords Paper

visual saliency, salient object detection, rgb-d, depth information, joint learning, dense connections, multi-modal features, feature fusion, deep learning, encoder-decoder

0

0

0

0

1:01

14/06/2020

Action Modifiers: Learning From Adverbs in Instructional Videos

Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

Keywords Paper

vision and language, video understanding, action recognition, action retrieval, instructional videos, weakly-supervised videos, action and behaviour, attributes, attention, adverbs

0

0

0

0

1:01

06/12/2020

Learning Semantic-aware Normalization for Generative Adversarial Networks

Heliang Zheng, Jianlong Fu, zengyh Zeng and
Jiebo Luo, Zheng-Jun Zha

Keywords Paper

0

0

0

0

3:11

06/12/2021

Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan, Bryan Plummer, Kate Saenko and
Hailin Jin, Bryan Russell

Keywords Paper

optimization

0

0

0

0

12:28

06/12/2021

Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Xuefan Zha, Wentao Zhu, Lv Xun and
Sen Yang, Ji Liu

Keywords Paper

machine learning, transformers, vision, language

0

0

0

0

6:14

22/11/2021

Hierarchical Contrastive Motion Learning for Video Action Recognition

Xitong Yang, Xiaodong Yang, Sifei Liu and
Deqing Sun, Larry Davis, Jan Kautz

Keywords Paper

action recognition, motion hierarchy, motion representation, contrastive learning

0

0

0

0

8:29

07/09/2020

Attention Distillation for Learning Video Representations

Miao Liu, Xin Chen, Yun Zhang and
Yin Li, James Rehg

Keywords Paper

Action Recognition, Deep Learning, Representation Learning

0

0

0

0

9:50

02/02/2021

Object Relation Attention for Image Paragraph Captioning

Li-Chuan Yang, Chih-Yuan Yang, Jane Yung-jen Hsu

Keywords Paper

0

0

0

0

15:03

06/12/2020

Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado, Yi Li, Nuno Nvasconcelos

Keywords Paper

0

0

0

0

3:21

30/11/2020

Image Captioning through Image Transformer

Sen He, Wentong Liao, Hamed R. Tavakoli and
Michael Yang, Bodo Rosenhahn, Nicolas Pugeault

Keywords Paper

0

0

0

0

9:49

02/02/2021

Bridging Towers of Multi-task Learning with a Gating Mechanism for Aspect-based Sentiment Analysis and Sequential Metaphor Identification

Rui Mao, Xiao Li

Keywords Paper

0

0

0

0

19:27

02/02/2021

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Yang Fu, Linjie Yang, Ding Liu and
Thomas S. Huang, Humphrey Shi

Keywords Paper

0

0

0

0

16:24

06/12/2021

Associating Objects with Transformers for Video Object Segmentation

Zongxin Yang, Yunchao Wei, Yi Yang

Keywords Paper

transformers

0

0

0

0

12:29

02/02/2021

F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

Daizong Liu, Dongdong Yu, Changhu Wang, Pan Zhou

Keywords Paper

0

0

0

0

16:59

02/02/2021

Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval

Qingchao Chen, Yang Liu, Samuel Albanie

Keywords Paper

0

0

0

0

15:19

14/06/2020

Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields

Jinglei Shi, Xiaoran Jiang, Christine Guillemot

Keywords Paper

light field, view synthesis, feature-based reconstruction, pixel-based reconstruction, deep learning, angular super-resolution

0

0

0

0

4:56

06/12/2021

Class-agnostic Reconstruction of Dynamic Objects from Videos

Zhongzheng Ren, Xiaoming Zhao, Alex Schwing

Keywords Paper

0

0

0

0

13:29

06/12/2021

Compressed Video Contrastive Learning

Yuqi Huo, Mingyu Ding, Haoyu Lu and
Nanyi Fei, Zhiwu Lu, Ji-Rong Wen, Ping Luo

Keywords Paper

self-supervised learning, contrastive learning, representation learning

0

0

0

0

9:07

14/06/2020

Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen

Keywords Paper

multi-granularity attention, video person re-identification, attentive feature aggregation, reference-aided attention, feature relations

0

0

0

0

1:01

26/04/2020

Neural Outlier Rejection for Self-Supervised Keypoint Learning

Jiexiong Tang, Hanme Kim, Vitor Guizilini and
Sudeep Pillai, Rares Ambrus

Keywords Paper

Self-Supervised Learning, Keypoint Detection, Outlier Rejection, Deep Learning

0

0

0

0

4:55

03/05/2021

$i$-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning

Kibok Lee, Yian Zhu, Kihyuk Sohn and
Chun-Liang Li, Jinwoo Shin, Honglak Lee

Keywords Paper

self-supervised learning, unsupervised representation learning, data augmentation, MixUp, contrastive representation learning

0

0

0

0

5:04

30/11/2020

Mask-Ranking Network for Semi-Supervised Video Object Segmentation

Wenjing Li, Xiang Zhang, Yujie Hu, Yingqi Tang

Keywords Paper

0

0

0

0

5:36

22/11/2021

Segmenting Invisible Moving Objects

Hala Lamdouar, Weidi Xie, Andrew Zisserman

Keywords Paper

synthetic data generation, motion segmentation, amodal segmentation, video camouflage breaking, self-attention

0

0

0

0

3:05