Local-Global Video-Text Interactions for Temporal Grounding

14/06/2020

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

Keywords: temporal grounding, temporal moment retrieval, localization by natural language, video understanding, vision and language

Abstract Paper Similar Papers

Abstract: This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query. We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query (e.g., actors, objects, and actions), and reflect bi-modal interactions between the linguistic features of the query and the visual features of the video in multiple levels. The proposed method effectively predicts the target time interval by exploiting contextual information from local to global during bi-modal interactions. Through in-depth ablation studies, we find out that incorporating both local and global context in video and text interactions is crucial to the accurate grounding. Our experiment shows that the proposed method outperforms the state of the arts on Charades-STA and ActivityNet Captions datasets by large margins, 7.44\% and 4.61\% points at Recall@tIoU=0.5 metric, respectively.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at CVPR 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

02/02/2021

Semantic Grouping Network for Video Captioning

Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo

Keywords Paper

0

0

0

0

17:41

05/01/2021

DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando and
Hongdong Li, Stephen Gould

Keywords Paper

0

0

0

0

5:02

25/07/2020

3D self-attention for unsupervised video quantization

Jingkuan Song, Ruimin Lang, Xiaosu Zhu and
Xing Xu, Lianli Gao, Heng Tao Shen

Keywords Paper

quantization, video retrieval, ann search

0

0

0

0

9:44

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

02/02/2021

Spatial-temporal Causal Inference for Partial Image-to-video Adaptation

Jin Chen, Xinxiao Wu, Yao Hu, Jiebo Luo

Keywords Paper

0

0

0

0

20:01

19/08/2021

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang, Mengdan Zhang, Runnan Chen and
Guanyu Cai, Penghao Zhou, Pai Peng, Xiaowei Guo, Jian Wu, Xing Sun

Keywords Paper

Computer Vision, Language and Vision, Deep Learning

0

0

0

0

9:07

06/12/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

Keywords Paper

0

0

0

0

3:16

02/02/2021

Proposal-Free Video Grounding with Contextual Pyramid Network

Kun Li, Dan Guo, Meng Wang

Keywords Paper

0

0

0

0

14:19

14/06/2020

Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning

Yuan Yao, Chang Liu, Dezhao Luo and
Yu Zhou, Qixiang Ye

Keywords Paper

self-supervised spatio-temporal representation learning, multi-temporal resolution characteristic, playback rate perception, motion attention mechanism

0

0

0

0

1:01

02/02/2021

Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval

Qingchao Chen, Yang Liu, Samuel Albanie

Keywords Paper

0

0

0

0

15:19

05/01/2021

Temporal Context Aggregation for Video Retrieval With Contrastive Learning

Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue

Keywords Paper

0

0

0

0

4:50

19/08/2021

Text-based Person Search via Multi-Granularity Embedding Learning

Chengji Wang, Zhiming Luo, Yaojin Lin, Shaozi Li

Keywords Paper

Computer Vision, Language and Vision, Recognition

0

0

0

0

12:25

05/01/2021

Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding

Jesus Perez-Martin, Benjamin Bustos, Jorge Perez

Keywords Paper

0

0

0

0

5:01

14/06/2020

Modality Shifting Attention Network for Multi-Modal Video Question Answering

Junyeong Kim, Minuk Ma, Trung Pham and
Kyungsu Kim, Chang D. Yoo

Keywords Paper

mutli-modal video question answering, visual reasoning, vision-langauge interaction, computer vision

0

0

0

0

1:01

07/09/2020

Refinement of Boundary Regression Using Uncertainty in Temporal Action Localization

Yunze Chen, Mengjuan Chen, Rui Wu and
Jiagang Zhu, Zheng Zhu, Qingyi Gu

Keywords Paper

Temporal Action Localization, Temporal Action Detection, Activity recognition and understanding

0

0

0

0

5:09

14/06/2020

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah

Keywords Paper

segmentation, localization, video, capsule, natural language, action, a2d, routing

0

0

0

0

4:58

05/01/2021

Unsupervised Video Representation Learning by Bidirectional Feature Prediction

Nadine Behrmann, Jurgen Gall, Mehdi Noroozi

Keywords Paper

0

0

0

0

4:57

05/01/2021

Alleviating Over-Segmentation Errors by Detecting Action Boundaries

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, Hirokatsu Kataoka

Keywords Paper

0

0

0

0

4:48

06/12/2021

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Aadarsh Sahoo, Rutav Shah, Rameswar Panda and
Kate Saenko, Abir Das

Keywords Paper

domain adaptation, contrastive learning

0

0

0

0

13:20

14/06/2020

Context-Aware Attention Network for Image-Text Retrieval

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li

Keywords Paper

image-text retrieval, multimodal, attention

0

0

0

0

1:01

19/08/2021

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Zhong Ji, Kexin Chen, Haoran Wang

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

6:07

02/02/2021

Query-Memory Re-Aggregation for Weakly-supervised Video Object Segmentation

Fanchao Lin, Hongtao Xie, Yan Li, Yongdong Zhang

Keywords Paper

0

0

0

0

14:19

01/07/2020

On Incorporating Structural Information to improve Dialogue Response Generation

Nikita Moghe, Priyesh Vijayan, Balaraman Ravindran, Mitesh M. Khapra

Keywords Paper

0

0

0

0

13:00

30/11/2020

Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information

Yiqing Huang, Jiansheng Chen

Keywords Paper

0

0

0

0

7:08

14/06/2020

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Hui Chen, Guiguang Ding, Xudong Liu and
Zijia Lin, Ji Liu, Jungong Han

Keywords Paper

cross-modal image text retrieval, iterative matching, recurrent attention memory

0

0

0

0

1:04

18/07/2021

Decomposed Mutual Information Estimation for Contrastive Representation Learning

Alessandro Sordoni, Nouha Dziri, Hannes Schulz and
Geoff Gordon, Philip Bachman, Remi Tachet des Combes

Keywords Paper

Algorithms, Unsupervised Learning

0

0

0

0

4:57

07/09/2020

MDA-Net: Memorable Domain Adaptation Network for Monocular Depth Estimation

Jing Zhu, Yunxiao Shi, Mengwei Ren, Yi Fang

Keywords Paper

depth estimation, LSTM, autonomous driving, visual perception

0

0

0

0

5:59

02/02/2021

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Ziyi Liu, Le Wang, Wei Tang and
Junsong Yuan, Nanning Zheng, Gang Hua

Keywords Paper

0

0

0

0

19:49

14/06/2020

Dense Regression Network for Video Grounding

Runhao Zeng, Haoming Xu, Wenbing Huang and
Peihao Chen, Mingkui Tan, Chuang Gan

Keywords Paper

video grounding, sparse annotations, dense regression, multi-level fusion

0

0

0

0

0:57

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

03/05/2021

Support-set bottlenecks for video-text representation learning

Mandela Patrick, Po-Yao Huang, Yuki Asano and
Florian Metze, Alexander G Hauptmann, Joao F. Henriques, Andrea Vedaldi

Keywords Paper

contrastive learning, video-text learning, multi-modal learning, video representation learning

0

0

0

0

6:40

22/11/2021

Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions

Zhao Yang, Yansong Tang, Luca Bertinetto and
Hengshuang Zhao, Philip Torr

Keywords Paper

segmentation, video object segmentation, referring segmentation, referring video object segmentation, video object segmentation from referring expressions, referring image segmentation, referring image comprehension, optical flow, visual grounding

0

0

0

0

2:57

16/11/2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

Qinxin Wang, Hao Tan, Sheng Shen and
Michael Mahoney, Zhewei Yao

Keywords Paper

phrase localization, visually-aware representations, weakly-supervised scenarios, ablation studies

0

0

0

0

6:59

19/08/2021

Dependent Multi-Task Learning with Causal Intervention for Image Captioning

Wenqing Chen, Jidong Tian, Caoyun Fan and
Hao He, Yaohui Jin

Keywords Paper

Machine Learning, Transfer, Adaptation, Multi-task Learning, Natural Language Generation, Language and Vision

0

0

0

0

12:02

19/08/2021

Learning Implicit Temporal Alignment for Few-shot Video Classification

Songyang Zhang, Jiale Zhou, Xuming He

Keywords Paper

Computer Vision, Action Recognition, Deep Learning

0

0

0

0

6:20

22/11/2021

Space-Time Memory Network for Sounding Object Localization in Videos

Sizhe Li, Yapeng Tian, Chenliang Xu

Keywords Paper

Sounding object Localization, Space-Time Memory Network, Audio-Visual

0

0

0

0

2:57

07/09/2020

Two-Stream Spatiotemporal Compositional Attention Network for VideoQA

Taiki Miyanishi, Takuya Maekawa, Motoaki Kawanabe

Keywords Paper

video question answering

0

0

0

0

2:02

06/12/2021

Contextual Similarity Aggregation with Self-attention for Visual Re-ranking

Jianbo Ouyang, Hui Wu, Min Wang and
Wengang Zhou, Houqiang Li

Keywords Paper

robustness, transformers

0

0

0

0

6:34

26/04/2020

Few-shot Text Classification with Distributional Signatures

Yujia Bao, Menghua Wu, Shiyu Chang, Regina Barzilay

Keywords Paper

text classification, meta learning, few shot learning

0

0

0

0

4:44

06/12/2021

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee and
Yen-Yu Lin, Ming-Hsuan Yang

Keywords Paper

0

0

0

0

14:06