Taming Visually Guided Sound Generation

22/11/2021

Taming Visually Guided Sound Generation

Vladimir Iashin, Esa Rahtu

Keywords: multi-modal learning, audio generation, video understanding, transformer, VQVAE, MelGAN, perceptual loss, generation metrics, VGGSound, VAS

Abstract Paper Code Similar Papers

Abstract: Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at BMVC 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian and
Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

Keywords Paper

machine learning, self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

15:59

26/04/2020

High Fidelity Speech Synthesis with Adversarial Networks

Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman and
Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

Keywords Paper

texttospeech, speechsynthesis, audiosynthesis, gans, generativeadversarialnetworks, implicitgenerativemodels

0

0

0

0

15:07

02/11/2020

Self-supervised classification for detecting anomalous sounds

Ritwik Giri, Srikanth V. Tenneti, Fangzhou Cheng and
Karim Helwani, Umut Isik, Arvindh Krishnaswamy

Keywords Paper

0

0

0

0

13:28

03/05/2021

Learning Energy-Based Models by Diffusion Recovery Likelihood

Ruiqi Gao, Yang Song, Ben Poole and
Yingnian Wu, Durk Kingma

Keywords Paper

recovery likelihood, EBM, energy-based model, generative model, HMC, Langevin dynamics, MCMC, diffusion process

0

0

0

0

6:03

18/07/2021

On Perceptual Lossy Compression: The Cost of Perceptual Reconstruction and An Optimal Training Framework

Zeyu Yan, Fei Wen, rendong Ying and
Chao Ma, Peilin Liu

Keywords Paper

Theory, Information Theory

0

0

0

0

5:08

06/12/2020

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

Keywords Paper

0

0

0

0

2:54

06/12/2020

Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

Tong Che, Ruixiang ZHANG, Jascha Sohl-Dickstein and
Hugo Larochelle, Liam Paull, Yuan Cao, Yoshua Bengio

Keywords Paper

0

0

0

0

3:19

18/07/2021

Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

Keywords Paper

, Algorithms, AutoML, Deep Learning, Architectures

0

0

0

0

5:15

22/11/2021

Audio-Visual Synchronisation in the wild

Triantafyllos Afouras, Honglie Chen, Weidi Xie and
Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

Keywords Paper

multimodal learning, self supervision, audio-visual synchronisation, dataset

0

0

0

0

3:02

06/12/2021

Data-Efficient Instance Generation from Instance Discrimination

Ceyuan Yang, Yujun Shen, Yinghao Xu, Bolei Zhou

Keywords Paper

machine learning, generative model

0

0

0

0

6:53

03/05/2021

Overfitting for Fun and Profit: Instance-Adaptive Data Compression

Ties van Rozendaal, Iris Huijben, Taco Cohen

Keywords Paper

Neural data compression, Learned compression, Generative modeling, Finetuning, Overfitting, Instance learning, Instance adaptation, Variational autoencoders, Rate-distortion optimization, Model compression, Weight quantization

0

0

0

0

4:59

02/02/2021

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Lirong Dai

Keywords Paper

0

0

0

0

19:58

18/07/2021

SoundDet: Polyphonic Moving Sound Event Detection and Localization from Raw Waveform

Yuhang He, Niki Trigoni, Andrew Markham

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

4:34

03/05/2021

Anytime Sampling for Autoregressive Models via Ordered Autoencoding

Yilun Xu, Yang Song, Sahaj Garg and
Linyuan Gong, Rui Shu, Aditya Grover, Stefano Ermon

Keywords Paper

0

0

0

0

6:27

06/12/2020

Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample

Shir Gur, Sagie Benaim, Lior Wolf

Keywords Paper

0

0

0

0

3:20

03/05/2021

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Efthymios Tzinis, Scott Wisdom, Aren Jansen and
Shawn Hershey, Tal Remez, Dan Ellis, John Hershey

Keywords Paper

self-supervised learning, universal sound separation, in-the-wild data, Audio-visual sound separation, unsupervised learning

0

0

0

0

5:06

06/12/2020

Make One-Shot Video Object Segmentation Efficient Again

Tim Meinhardt, Laura Leal-Taixé

Keywords Paper

0

0

0

0

3:17

14/06/2020

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Sachit Menon, Alexandru Damian, Shijia Hu and
Nikhil Ravi, Cynthia Rudin

Keywords Paper

perceptual super-resolution, face hallucination, upsampling, generative models, gan, self-supervision

0

0

0

0

1:01

22/11/2021

One Model to Reconstruct Them All: A Novel Way to Use the Stochastic Noise in StyleGAN

Christian Bartz, Joseph Bethge, Haojin Yang, Meinel Christoph

Keywords Paper

stylegan, generative adversarial networks, stochastic noise, encoder, reconstruction, denoising

0

0

0

0

3:04

14/06/2020

Supervised Raw Video Denoising With a Benchmark Dataset on Dynamic Scenes

Huanjing Yue, Cong Cao, Lei Liao and
Ronghe Chu, Jingyu Yang

Keywords Paper

raw video denoising, isp, cnns, deep learning, computer vision

0

0

0

0

1:02

02/02/2021

Binaural Audio-Visual Localization

Xinyi Wu, Zhenyao Wu, Lili Ju, Song Wang

Keywords Paper

0

0

0

0

13:42

14/06/2020

Regularization on Spatio-Temporally Smoothed Feature for Action Recognition

Jinhyung Kim, Seunghwan Cha, Dongyoon Wee and
Soonmin Bae, Junmo Kim

Keywords Paper

regularization, action recognition, video classification

0

0

0

0

1:01

22/11/2021

Faster-FCoViAR: Faster Frequency-Domain Compressed Video Action Recognition

Lu Xiong, Xia Jia, Yue Ming and
Jiangwan Zhou, Fan Feng, Nan nan Hu

Keywords Paper

action recognition, frequency-domain, compressed videos, teacher-student network

0

0

0

0

3:00

22/11/2021

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

Rishabh Garg, Ruohan Gao, Kristen Grauman

Keywords Paper

Binaural Audio, Audio visual learning

0

0

0

0

9:48

22/11/2021

Audio-Visual Speech Super-Resolution

Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri, C.V. Jawahar

Keywords Paper

speech super-resolution, audio-visual data, audio-visual learning, pseudo-visual stream, multi-modal learning

0

0

0

0

10:01

03/05/2021

Neural Synthesis of Binaural Speech From Mono Audio

Alexander Richard, Dejan Markovic, Israel Gebru and
Steven Krenn, Gladstone A Butler, Fernando Torre, Yaser Sheikh

Keywords Paper

speech generation, speech processing, binaural speech, neural sound synthesis, sound spatialization, binaural audio

0

0

0

0

15:00

14/06/2020

Scale-Space Flow for End-to-End Optimized Video Compression

Eirikur Agustsson, David Minnen, Nick Johnston and
Johannes Ballé, Sung Jin Hwang, George Toderici

Keywords Paper

learned video compression, scale-space flow, bilinear warping

0

0

0

0

0:55

02/02/2021

Symbolic Music Generation with Transformer-GANs

Aashiq Muhamed, Liang Li, Xingjian Shi and
Suri Yaddanapudi, Wayne Chi, Dylan Jackson, Rahul Suresh, Zachary C. Lipton, Alex J. Smola

Keywords Paper

0

0

0

0

20:01

14/06/2020

Discriminative Multi-Modality Speech Recognition

Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang

Keywords Paper

multi-modal, audio-visual, speech recognition, lip reading, deep learning, eleatt-gru, deep learning

0

0

0

0

1:01

12/07/2020

Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

0

15:12

14/06/2020

Learning Rank-1 Diffractive Optics for Single-Shot High Dynamic Range Imaging

Qilin Sun, Ethan Tseng, Qiang Fu and
Wolfgang Heidrich, Felix Heide

Keywords Paper

single-shot hdr, high-dynamic-range imaging, computational photography, computational optics, end-to-end camera optimization, automotive imaging

0

0

0

0

4:58

02/02/2021

SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains

Yuanqi Chen, Ge Li, Cece Jin and
Shan Liu, Thomas Li

Keywords Paper

0

0

0

0

14:27

06/12/2021

Improved Transformer for High-Resolution GANs

Long Zhao, Zizhao Zhang, Ting Chen and
Dimitris Metaxas, Han Zhang

Keywords Paper

transformers, generative model

0

0

0

0

12:11

06/12/2020

Differentiable Augmentation for Data-Efficient GAN Training

Shengyu Zhao, Zhijian Liu, Ji Lin and
Jun-Yan Zhu, Song Han

Keywords Paper

0

0

0

0

3:22

03/05/2021

WaveGrad: Estimating Gradients for Waveform Generation

Nanxin Chen, Yu Zhang, Heiga Zen and
Ron Weiss, Mohammad Norouzi, William Chan

Keywords Paper

gradient estimation, waveform generation, score matching, vocoder, diffusion, text-to-speech

0

0

0

0

5:09

14/06/2020

A2dele: Adaptive and Attentive Depth Distiller for Efficient RGB-D Salient Object Detection

Yongri Piao, Zhengkun Rong, Miao Zhang and
Weisong Ren, Huchuan Lu

Keywords Paper

rgb-d, salient object dection, knowledge distillation, attention, computer vision, cnn

0

0

0

0

1:00

17/08/2020

High resolution étendue expansion for holographic displays

Grace Kuo, Laura Waller, Ren Ng, Andrew Maimone

Keywords Paper

near-eye displays, computer generated holography, computational displays, augmented reality

0

0

0

0

2:36

02/11/2020

DCASE 2020 Task2: Anomalous sound detection using relevant spectral feature and focusing techniques in the unsupervised learning scenario

Jihwan Park, Sooyeon Yoo

Keywords Paper

0

0

0

0

11:06

02/11/2020

On multitask loss function for audio event detection and localization

Huy Phan, Lam Pham, Philipp Koch and
Ngoc Q. K. Duong, Ian McLoughlin, Alfred Mertins

Keywords Paper

0

0

0

0

15:16

06/12/2020

Compressing Images by Encoding Their Latent Representations with Relative Entropy Coding

Greg Flamich, Marton Havasi, Jose Miguel Hernández-Lobato

Keywords Paper

0

0

0

0

3:37