Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Abstract: Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

02/02/2021

self-supervised learning, unsupervised representation learning, data augmentation, MixUp, contrastive representation learning

5:04

05/01/2021

Martin Sundermeyer, Maximilian Durner, En Yen Puang and
Zoltan-Csaba Marton, Narunas Vaskevicius, Kai O. Arras, Rudolph Triebel

Keywords Paper

object pose estimation, encodings, multi object, synthetic data, symmetries, autoencoder, embedding, 6d object detection, t-less, relative pose estimation

1:01

14/06/2020

state representation learning, graph neural networks, model-based reinforcement learning, relational learning, object discovery

14:51

19/08/2021

visual saliency, salient object detection, rgb-d, depth information, joint learning, dense connections, multi-modal features, feature fusion, deep learning, encoder-decoder

1:01

02/02/2021

Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning

Yu Liu, Lianghua Huang, Pan Pan and
Bin Wang, Yinghui Xu, Rong Jin

transformer, image captioning, vision and language, fully-attentive models, mesh connectivity, memory vectors, self-attention

1:00

14/06/2020

stereo matching, wavelet coefficients, inverse wavelet transform, supervised learning, deep representation, multi-scale features, multi-resolution cost volume, wavelet regression, disparity reconstruction, disparity refinement

1:01

03/05/2021

saliency detection, salient object detection, feature interaction strategy, scale-insensitive loss, multi-scale features, multi-level features, fully convolutional network, deep learning

1:01

06/12/2020

Mengde Xu, Zheng Zhang, Fangyun Wei and
Yutong Lin, Yue Cao, Stephen Lin, Han Hu, Xiang Bai

subspace, few, shot, meta, learning, classification

1:01

06/12/2021