Contrastive Learning of Global and Local Video Representations

06/12/2021

Contrastive Learning of Global and Local Video Representations

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song

Keywords: machine learning, self-supervised learning, contrastive learning, representation learning

Abstract Paper Similar Papers

Abstract: Contrastive learning has delivered impressive results for various tasks in the self-supervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., global representations suitable for tasks such as classification or local representations for tasks such as detection and localization. While they produce satisfactory results in the intended downstream scenarios, they often fail to generalize to tasks that they were not originally designed for. In this work, we propose to learn video representations that generalize to both the tasks which require global semantic information (e.g., classification) and the tasks that require local fine-grained spatio-temporal information (e.g., localization). We achieve this by optimizing two contrastive objectives that together encourage our model to learn global-local visual information given audio signals. We show that the two objectives mutually improve the generalizability of the learned global-local representations, significantly outperforming their disjointly learned counterparts. We demonstrate our approach on various tasks including action/sound classification, lip reading, deepfake detection, event and sound localization.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado, Yi Li, Nuno Nvasconcelos

Keywords Paper

0

0

0

0

3:21

06/12/2020

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel, Dhruv Mahajan, Bruno Korbar and
Lorenzo Torresani, Bernard Ghanem, Du Tran

Keywords Paper

, Applications -> Computer Vision

0

0

0

0

3:17

06/12/2021

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee and
Yen-Yu Lin, Ming-Hsuan Yang

Keywords Paper

0

0

0

0

14:06

03/05/2021

Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Juntae Lee, Mihir Jain, Hyoungwoo Park, Sungrack Yun

Keywords Paper

Action localization, Multimodal Attention, Audio-Visual, Weak-supervision, Event localization

0

0

0

0

5:11

06/12/2021

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Luc V Gool

Keywords Paper

self-supervised learning, vision, contrastive learning, representation learning

0

0

0

0

13:32

06/12/2021

Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Aadarsh Sahoo, Rutav Shah, Rameswar Panda and
Kate Saenko, Abir Das

Keywords Paper

domain adaptation, contrastive learning

0

0

0

0

13:20

03/05/2021

Support-set bottlenecks for video-text representation learning

Mandela Patrick, Po-Yao Huang, Yuki Asano and
Florian Metze, Alexander G Hauptmann, Joao F. Henriques, Andrea Vedaldi

Keywords Paper

contrastive learning, video-text learning, multi-modal learning, video representation learning

0

0

0

0

6:40

06/12/2021

CCVS: Context-aware Controllable Video Synthesis

Guillaume Le Moing, Jean Ponce, Cordelia Schmid

Keywords Paper

adversarial robustness and security, self-supervised learning, transformers, generative model

0

0

0

0

11:59

14/06/2020

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound

Karren Yang, Bryan Russell, Justin Salamon

Keywords Paper

audio-visual learning in video, self-supervision, video dataset, spatial audio, localization, spatialization, upmixing, source separation

0

0

0

0

4:41

02/02/2021

Binaural Audio-Visual Localization

Xinyi Wu, Zhenyao Wu, Lili Ju, Song Wang

Keywords Paper

0

0

0

0

13:42

06/12/2021

Compressive Visual Representations

Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama and
John Canny, Ian Fischer

Keywords Paper

theory, machine learning, robustness, self-supervised learning, contrastive learning

0

0

0

0

6:30

02/02/2021

Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context

Ziyi Liu, Le Wang, Wei Tang and
Junsong Yuan, Nanning Zheng, Gang Hua

Keywords Paper

0

0

0

0

19:49

06/12/2020

Self-supervised Co-Training for Video Representation Learning

Tengda Han, Weidi Xie, Andrew Zisserman

Keywords Paper

0

0

0

0

3:08

03/05/2021

Disentangled Recurrent Wasserstein Autoencoder

Jun Han, Martin Min, Ligong Han and
Li Erran Li, Xuan Zhang

Keywords Paper

Recurrent Generative Model, Sequential Representation Learning, Disentanglement

0

0

0

0

9:17

14/06/2020

Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume

Adrian Johnston, Gustavo Carneiro

Keywords Paper

self-supervised depth estimation, self-supervised learning, self-attention, depth estimation, uncertainty

0

0

0

0

1:01

19/08/2021

Multi-Scale Selective Feedback Network with Dual Loss for Real Image Denoising

Xiaowan Hu, Yuanhao Cai, Zhihong Liu and
Haoqian Wang, Yulun Zhang

Keywords Paper

Computer Vision, Computational Photography, Photometry, Shape from X, Deep Learning

0

0

0

0

9:52

22/11/2021

Fine-grained Multi-Modal Self-Supervised Learning

Duo Wang, Salah Karout

Keywords Paper

self-supervised learning, multi-modal learning

0

0

0

0

2:46

26/04/2020

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification

Yixiao Ge, Dapeng Chen, Hongsheng Li

Keywords Paper

Label Refinery, Unsupervised Domain Adaptation, Person Re-identification

0

0

0

0

5:03

06/12/2020

Learning Semantic-aware Normalization for Generative Adversarial Networks

Heliang Zheng, Jianlong Fu, zengyh Zeng and
Jiebo Luo, Zheng-Jun Zha

Keywords Paper

0

0

0

0

3:11

08/12/2020

Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention

Mingda Li, Xinyue Liu, Weitong Ruan and
Luca Soldaini, Wael Hamza, Chengwei Su

Keywords Paper

0

0

0

0

14:43

03/05/2021

Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning

Dong Bok Lee, Dongchan Min, Seanie Lee, Sung Ju Hwang

Keywords Paper

Unsupervised Learning, Variational Autoencoders, Unsupervised Meta-learning, Meta-Learning

0

0

0

0

13:31

06/12/2021

Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

Reuben Tan, Bryan Plummer, Kate Saenko and
Hailin Jin, Bryan Russell

Keywords Paper

optimization

0

0

0

0

12:28

18/07/2021

Robust Representation Learning via Perceptual Similarity Metrics

Saeid A Taghanaki, Kristy Choi, Amir Hosein Khasahmadi, Anirudh Goyal

Keywords Paper

Deep Learning, Embedding and Representation learning

0

0

0

0

4:48

02/02/2021

Self-supervised Pre-training and Contrastive Representation Learning for Multiple-choice Video QA

Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim and
Inho Kang, Nojun Kwak

Keywords Paper

0

0

0

0

15:23

14/06/2020

Action Modifiers: Learning From Adverbs in Instructional Videos

Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, Dima Damen

Keywords Paper

vision and language, video understanding, action recognition, action retrieval, instructional videos, weakly-supervised videos, action and behaviour, attributes, attention, adverbs

0

0

0

0

1:01

13/04/2021

CLAR: Contrastive learning of auditory representations

Haider Al-Tahan, Yalda Mohsenzadeh

Keywords Paper

0

0

0

0

3:34

05/01/2021

Data-Efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

Jianan Wang, Boyang Li, Xiangyu Fan and
Jing Lin, Yanwei Fu

Keywords Paper

0

0

0

0

4:49

06/12/2020

Do Adversarially Robust ImageNet Models Transfer Better?

Hadi Salman, Andrew Ilyas, Logan Engstrom and
Ashish Kapoor, Aleksander Madry

Keywords Paper

0

0

0

0

4:16

06/12/2021

Improving Transferability of Representations via Augmentation-Aware Self-Supervision

Hankook Lee, Kibok Lee, Kimin Lee and
Honglak Lee, Jinwoo Shin

Keywords Paper

self-supervised learning, contrastive learning, representation learning, transfer learning

0

0

0

0

10:10

02/02/2021

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Yan-Bo Lin, Yu-Chiang Frank Wang

Keywords Paper

0

0

0

0

15:06

14/06/2020

Learning Memory-Guided Normality for Anomaly Detection

Hyunjong Park, Jongyoun Noh, Bumsub Ham

Keywords Paper

anomaly detection, unsupervised learning, memory networks, prototypical feature, pattern clustering, feature extraction, video recognition, convolutional neural networks

0

0

0

0

1:01

06/12/2021

When does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?

Lijie Fan, Sijia Liu, Pin-Yu Chen and
Gaoyuan Zhang, Chuang Gan

Keywords Paper

machine learning, robustness, adversarial robustness and security, self-supervised learning, vision, contrastive learning, clustering

0

0

0

0

7:33

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

02/02/2021

SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

Ting Yao, Yiheng Zhang, Zhaofan Qiu and
Yingwei Pan, Tao Mei

Keywords Paper

0

0

0

0

16:17

07/09/2020

Attentive Action and Context Factorization

Yang Wang, Vinh Tran, Gedas Bertasius and
Lorenzo Torresani, Minh Hoai Nguyen

Keywords Paper

action factorization, attention, conjugate samples

0

0

0

0

9:59

02/02/2021

Semantic Grouping Network for Video Captioning

Hobin Ryu, Sunghun Kang, Haeyong Kang, Chang D. Yoo

Keywords Paper

0

0

0

0

17:41

14/06/2020

Multi-Mutual Consistency Induced Transfer Subspace Learning for Human Motion Segmentation

Tao Zhou, Huazhu Fu, Chen Gong and
Jianbing Shen, Ling Shao, Fatih Porikli

Keywords Paper

human motion segmentation, transfer subspace learning, multi-level features, multi-mutual consistency learning.

0

0

0

0

1:00

02/02/2021

Spatial-temporal Causal Inference for Partial Image-to-video Adaptation

Jin Chen, Xinxiao Wu, Yao Hu, Jiebo Luo

Keywords Paper

0

0

0

0

20:01

03/05/2021

Improving Transformation Invariance in Contrastive Representation Learning

Adam Foster, Rattana Pukdee, Tom Rainforth

Keywords Paper

transformation invariance, contrastive learning, representation learning

0

0

0

0

5:23

06/12/2021

TriBERT: Human-centric Audio-visual Representation Learning

Tanzila Rahman, Mengyu Yang, Leonid Sigal

Keywords Paper

transformers, representation learning

0

0

0

0

13:54