Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

19/08/2021

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang, Mengdan Zhang, Runnan Chen, Guanyu Cai, Penghao Zhou, Pai Peng, Xiaowei Guo, Jian Wu, Xing Sun

Keywords: Computer Vision, Language and Vision, Deep Learning

Abstract Paper Similar Papers

Abstract: Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at IJCAI 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

05/01/2021

PDAN: Pyramid Dilated Attention Network for Action Detection

Rui Dai, Srijan Das, Luca Minciullo and
Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Keywords Paper

0

0

0

0

5:00

14/06/2020

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu

Keywords Paper

video-text retrieval, cross-modal matching, graph neural network

0

0

0

0

1:01

16/11/2020

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Hung Le, Doyen Sahoo, Nancy Chen, Steven C.H. Hoi

Keywords Paper

video-grounded dialogues, high-resolution queries, video setting, bi-directional learning

0

0

0

0

11:05

14/06/2020

Memory Enhanced Global-Local Aggregation for Video Object Detection

Yihong Chen, Yue Cao, Han Hu, Liwei Wang

Keywords Paper

video object detection, video analysis, object detection, memory, global-local aggregation

0

0

0

0

1:00

06/12/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

Keywords Paper

0

0

0

0

3:16

22/11/2021

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

Lijin Yang, Yifei Huang, Yusuke Sugano, Yoichi Sato

Keywords Paper

Egocentric action recognition, Action recognition, Temporal attention

0

0

0

0

3:01

04/07/2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Jie Lei, Liwei Wang, Yelong Shen and
Dong Yu, Tamara Berg, Mohit Bansal

Keywords Paper

Coherent Captioning, Generating descriptions, captioning tasks, coherent generation

0

0

0

0

10:51

03/05/2021

Self-Supervised Learning of Compressed Video Representations

Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song

Keywords Paper

self-supervised learning, Compressed videos

0

0

0

0

4:34

02/02/2021

F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

Daizong Liu, Dongdong Yu, Changhu Wang, Pan Zhou

Keywords Paper

0

0

0

0

16:59

22/11/2021

GTA: Global Temporal Attention for Video Action Understanding

Bo He, Xitong Yang, Zuxuan Wu and
Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

Keywords Paper

action recognition, self-attention, temporal modeling

0

0

0

0

2:55

22/09/2020

Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations

Hongyan Tang, Junning Liu, Ming Zhao, Xudong Gong

Keywords Paper

Recommender System, Multi-task Learning, Seesaw Phenomenon

0

0

0

0

3:20

14/06/2020

Context-Aware Attention Network for Image-Text Retrieval

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li

Keywords Paper

image-text retrieval, multimodal, attention

0

0

0

0

1:01

05/01/2021

Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding

Jesus Perez-Martin, Benjamin Bustos, Jorge Perez

Keywords Paper

0

0

0

0

5:01

02/02/2021

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Yang Fu, Linjie Yang, Ding Liu and
Thomas S. Huang, Humphrey Shi

Keywords Paper

0

0

0

0

16:24

02/02/2021

Semantic Grouping Network for Video Captioning

Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo

Keywords Paper

0

0

0

0

17:41

22/11/2021

Knowing What, Where and When to Look: Video Action modelling with Attention

Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu and
Antoine S Toisoul, Victor A Escorcia, Tao Xiang

Keywords Paper

Action recognition, Fine-grained action, video attention, Spatial attention, Channel attention, Temporal attention, Spatio-temporal attention, Feature refinement

0

0

0

0

2:46

05/01/2021

Temporal Context Aggregation for Video Retrieval With Contrastive Learning

Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue

Keywords Paper

0

0

0

0

4:50

02/02/2021

Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval

Qingchao Chen, Yang Liu, Samuel Albanie

Keywords Paper

0

0

0

0

15:19

02/02/2021

Structured Co-reference Graph Attention for Video-grounded Dialogue

Junyeong Kim, Sunjae Yoon, Dahyun Kim, Chang D. Yoo

Keywords Paper

0

0

0

0

14:54

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

14/06/2020

SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation

Lijun Wang, Jianming Zhang, Oliver Wang and
Zhe Lin, Huchuan Lu

Keywords Paper

monocular depth estimation, per-segment prediction, depth aggregation, semantic devide-and-conquer

0

0

0

0

1:01

14/06/2020

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Keywords Paper

textvqa, visual question answering, vqa, vision and language, st-vqa, ocr-vqa, transformer, pointer network, ocr

0

0

0

0

4:56

19/10/2020

Event-driven network for cross-modal retrieval

Zhixiong Zeng, Nan Xu, Wenji Mao

Keywords Paper

cross-modal retrieval, event embedding, text representation

0

0

0

0

5:59

16/11/2020

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Mingzhe Li, Xiuying Chen, Shen Gao and
Zhangming Chan, Dongyan Zhao, Rui Yan

Keywords Paper

video-based summarization, human evaluations, vmsmo, dual-interaction-based summarizer

0

0

0

0

12:22

14/06/2020

Cross-Modal Cross-Domain Moment Alignment Network for Person Search

Ya Jing, Wei Wang, Liang Wang, Tieniu Tan

Keywords Paper

cross-domain adaptation, text-based person search, moment alignment network, cross-modal retrieval, unsupervised learning

0

0

0

0

1:01

19/10/2020

Learning to generate reformulation actions for scalable conversational query understanding

Zihan Xu, Jiangang Zhu, Ling Geng and
Yang Yang, Bojia Lin, Daxin Jiang

Keywords Paper

contextual query reformulation, question answering, conversational query understanding

0

0

0

0

6:58

14/06/2020

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

Keywords Paper

temporal grounding, temporal moment retrieval, localization by natural language, video understanding, vision and language

0

0

0

0

1:01

02/02/2021

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Ziyi Liu, Le Wang, Wei Tang and
Junsong Yuan, Nanning Zheng, Gang Hua

Keywords Paper

0

0

0

0

19:49

22/11/2021

Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions

Zhao Yang, Yansong Tang, Luca Bertinetto and
Hengshuang Zhao, Philip Torr

Keywords Paper

segmentation, video object segmentation, referring segmentation, referring video object segmentation, video object segmentation from referring expressions, referring image segmentation, referring image comprehension, optical flow, visual grounding

0

0

0

0

2:57

19/08/2021

Deep Unified Cross-Modality Hashing by Pairwise Data Alignment

Yimu Wang, Bo Xue, Quan Cheng and
Yuhui Chen, Lijun Zhang

Keywords Paper

Computer Vision, Recognition, Information Retrieval, Deep Learning

0

0

0

0

13:11

02/02/2021

Spatial-temporal Causal Inference for Partial Image-to-video Adaptation

Jin Chen, Xinxiao Wu, Yao Hu, Jiebo Luo

Keywords Paper

0

0

0

0

20:01

05/01/2021

Supervoxel Attention Graphs for Long-Range Video Modeling

Yang Wang, Gedas Bertasius, Tae-Hyun Oh and
Abhinav Gupta, Minh Hoai, Lorenzo Torresani

Keywords Paper

0

0

0

0

2:01

06/12/2021

Temporal-attentive Covariance Pooling Networks for Video Recognition

Zilin Gao, Qilong Wang, Bingbing Zhang and
Qinghua Hu, Peihua Li

Keywords Paper

0

0

0

1

8:13

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

06/12/2020

OTLDA: A Geometry-aware Optimal Transport Approach for Topic Modeling

Viet Huynh, He Zhao, Dinh Phung

Keywords Paper

0

0

0

1

3:04

03/05/2021

VA-RED$^2$: Video Adaptive Redundancy Reduction

Bowen Pan, Rameswar Panda, Camilo L Fosco and
Chung-Ching Lin, Alex J Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

Keywords Paper

0

0

0

0

5:02

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

02/02/2021

Temporal Relational Modeling with Self-Supervision for Action Segmentation

Dong Wang, Di Hu, Xingjian Li, Dejing Dou

Keywords Paper

0

0

0

0

17:10

06/12/2020

Hierarchical Granularity Transfer Learning

Shaobo Min, Hongtao Xie, Hantao Yao and
Xuran Deng, Zheng-Jun Zha, Yongdong Zhang

Keywords Paper

0

0

0

0

3:07

22/11/2021

Temporal Meta-Adaptor for Video Object Detection

Chi Wang, Yang Hua, ZHENG LU and
Jian Gao, Neil Robertson

Keywords Paper

video object detection, temporal aggregation, meta-learning, ImageNet VID

0

0

0

0

6:58