COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

06/12/2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

Keywords:

Abstract Paper Similar Papers

Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

02/02/2021

Semantic Grouping Network for Video Captioning

Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo

Keywords Paper

0

0

0

0

17:41

19/08/2021

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang, Mengdan Zhang, Runnan Chen and
Guanyu Cai, Penghao Zhou, Pai Peng, Xiaowei Guo, Jian Wu, Xing Sun

Keywords Paper

Computer Vision, Language and Vision, Deep Learning

0

0

0

0

9:07

22/11/2021

Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions

Zhao Yang, Yansong Tang, Luca Bertinetto and
Hengshuang Zhao, Philip Torr

Keywords Paper

segmentation, video object segmentation, referring segmentation, referring video object segmentation, video object segmentation from referring expressions, referring image segmentation, referring image comprehension, optical flow, visual grounding

0

0

0

0

2:57

03/05/2021

Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Juntae Lee, Mihir Jain, Hyoungwoo Park, Sungrack Yun

Keywords Paper

Action localization, Multimodal Attention, Audio-Visual, Weak-supervision, Event localization

0

0

0

0

5:11

14/06/2020

Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen

Keywords Paper

multi-granularity attention, video person re-identification, attentive feature aggregation, reference-aided attention, feature relations

0

0

0

0

1:01

03/05/2021

Support-set bottlenecks for video-text representation learning

Mandela Patrick, Po-Yao Huang, Yuki Asano and
Florian Metze, Alexander G Hauptmann, Joao F. Henriques, Andrea Vedaldi

Keywords Paper

contrastive learning, video-text learning, multi-modal learning, video representation learning

0

0

0

0

6:40

05/01/2021

Improving Video Captioning With Temporal Composition of a Visual-Syntactic Embedding

Jesus Perez-Martin, Benjamin Bustos, Jorge Perez

Keywords Paper

0

0

0

0

5:01

14/06/2020

Context-Aware Attention Network for Image-Text Retrieval

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, Stan Z. Li

Keywords Paper

image-text retrieval, multimodal, attention

0

0

0

0

1:01

14/06/2020

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

Keywords Paper

temporal grounding, temporal moment retrieval, localization by natural language, video understanding, vision and language

0

0

0

0

1:01

06/12/2021

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Keywords Paper

self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

8:46

07/09/2020

Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Yang Wang

Keywords Paper

video thumbnail generation, conditional normalization

0

0

0

0

7:40

06/12/2021

Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan, Bryan Plummer, Kate Saenko and
Hailin Jin, Bryan Russell

Keywords Paper

optimization

0

0

0

0

12:28

02/02/2021

Non-Autoregressive Coarse-to-Fine Video Captioning

Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang

Keywords Paper

0

0

0

0

18:21

06/12/2020

Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado, Yi Li, Nuno Nvasconcelos

Keywords Paper

0

0

0

0

3:21

14/06/2020

Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning

Yuan Yao, Chang Liu, Dezhao Luo and
Yu Zhou, Qixiang Ye

Keywords Paper

self-supervised spatio-temporal representation learning, multi-temporal resolution characteristic, playback rate perception, motion attention mechanism

0

0

0

0

1:01

14/06/2020

Weakly-Supervised Action Localization by Generative Attention Modeling

Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang

Keywords Paper

action localization, weakly-supervised, action-context confusion, vae, generative

0

0

0

0

0:58

22/09/2020

Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations

Hongyan Tang, Junning Liu, Ming Zhao, Xudong Gong

Keywords Paper

Recommender System, Multi-task Learning, Seesaw Phenomenon

0

0

0

0

3:20

03/05/2021

Disentangled Recurrent Wasserstein Autoencoder

Jun Han, Martin Min, Ligong Han and
Li Erran Li, Xuan Zhang

Keywords Paper

Recurrent Generative Model, Sequential Representation Learning, Disentanglement

0

0

0

0

9:17

03/05/2021

Self-Supervised Learning of Compressed Video Representations

Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song

Keywords Paper

self-supervised learning, Compressed videos

0

0

0

0

4:34

14/06/2020

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Hui Chen, Guiguang Ding, Xudong Liu and
Zijia Lin, Ji Liu, Jungong Han

Keywords Paper

cross-modal image text retrieval, iterative matching, recurrent attention memory

0

0

0

0

1:04

14/06/2020

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Keywords Paper

textvqa, visual question answering, vqa, vision and language, st-vqa, ocr-vqa, transformer, pointer network, ocr

0

0

0

0

4:56

22/11/2021

GTA: Global Temporal Attention for Video Action Understanding

Bo He, Xitong Yang, Zuxuan Wu and
Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

Keywords Paper

action recognition, self-attention, temporal modeling

0

0

0

0

2:55

19/08/2021

Text-based Person Search via Multi-Granularity Embedding Learning

Chengji Wang, Zhiming Luo, Yaojin Lin, Shaozi Li

Keywords Paper

Computer Vision, Language and Vision, Recognition

0

0

0

0

12:25

19/04/2021

Exploiting multimodal reinforcement learning for simultaneous machine translation

Julia Ive, Andy Mingren Li, Yishu Miao and
Ozan Caglayan, Pranava Madhyastha, Lucia Specia

Keywords Paper

0

0

0

0

10:50

06/12/2020

Hierarchical Granularity Transfer Learning

Shaobo Min, Hongtao Xie, Hantao Yao and
Xuran Deng, Zheng-Jun Zha, Yongdong Zhang

Keywords Paper

0

0

0

0

3:07

06/12/2021

Intriguing Properties of Contrastive Losses

Ting Chen, Calvin Luo, Lala Li

Keywords Paper

self-supervised learning, vision, contrastive learning

0

0

0

0

13:36

19/08/2021

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Zhong Ji, Kexin Chen, Haoran Wang

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

6:07

04/07/2020

Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization

Hanqi Jin, Tianming Wang, Xiaojun Wan

Keywords Paper

Extractive Summarization, Extractive , abstractive summarization, Multi-Granularity Network

0

0

0

0

10:38

02/02/2021

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Ziyi Liu, Le Wang, Wei Tang and
Junsong Yuan, Nanning Zheng, Gang Hua

Keywords Paper

0

0

0

0

19:49

06/12/2020

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider and
Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

Keywords Paper

1

0

0

0

3:25

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

14/06/2020

Image Search With Text Feedback by Visiolinguistic Attention Learning

Yanbei Chen, Shaogang Gong, Loris Bazzani

Keywords Paper

vision and language, image search, text feedback, attention mechanism, transformer, multimodal learning, representation learning, composition, image retrieval, interactive image search

0

0

0

0

1:00

14/06/2020

Action Modifiers: Learning From Adverbs in Instructional Videos

Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

Keywords Paper

vision and language, video understanding, action recognition, action retrieval, instructional videos, weakly-supervised videos, action and behaviour, attributes, attention, adverbs

0

0

0

0

1:01

05/01/2021

PDAN: Pyramid Dilated Attention Network for Action Detection

Rui Dai, Srijan Das, Luca Minciullo and
Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Keywords Paper

0

0

0

0

5:00

14/06/2020

TEA: Temporal Excitation and Aggregation for Action Recognition

Yan Li, Bin Ji, Xintian Shi and
Jianguo Zhang, Bin Kang, Limin Wang

Keywords Paper

action recognition, temporal modeling, motion encoding, temporal aggregation

0

0

0

0

1:01

06/12/2020

Learning Semantic-aware Normalization for Generative Adversarial Networks

Heliang Zheng, Jianlong Fu, zengyh Zeng and
Jiebo Luo, Zheng-Jun Zha

Keywords Paper

0

0

0

0

3:11

06/12/2020

Bidirectional Convolutional Poisson Gamma Dynamical Systems

wenchao chen, Chaojie Wang, Bo Chen and
Yicheng Liu, Hao Zhang, Mingyuan Zhou

Keywords Paper

0

0

0

0

3:23

06/12/2021

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee and
Yen-Yu Lin, Ming-Hsuan Yang

Keywords Paper

0

0

0

0

14:06

22/11/2021

Domain Attention Consistency for Multi-Source Domain Adaptation

Zhongying Deng, Kaiyang Zhou, Yongxin Yang, Tao Xiang

Keywords Paper

Transferable Attribute Learning, Domain Attention Consistency, Multi-Source Domain Adaptation

0

0

0

0

9:24