S-VVAD: Visual Voice Activity Detection by Motion Segmentation

05/01/2021

S-VVAD: Visual Voice Activity Detection by Motion Segmentation

Muhammad Shahid, Cigdem Beyan, Vittorio Murino

Keywords:

Abstract Paper Similar Papers

Abstract: We address the challenging Voice Activity Detection (VAD) problem, which determines "Who is Speaking and When?" in audiovisual recordings. The typical audio-based VAD systems can be ineffective in the presence of ambient noise or noise variations. Moreover, due to technical or privacy reasons, audio might not be always available. In such cases, the use of video modality to perform VAD is desirable. Almost all existing visual VAD methods rely on body part detection, e.g., face, lips, or hands. In contrast, we propose a novel visual VAD method operating directly on the entire video frame, without the explicit need of detecting a person or his/her body parts. Our method, named S-VVAD, learns body motion cues associated with speech activity within a weakly supervised segmentation framework. Therefore, it not only detects the speakers/not-speakers but simultaneously localizes the image positions of them. It is an end-to-end pipeline, person-independent and it does not require any prior knowledge nor pre-processing. S-VVAD performs well in various challenging conditions and demonstrates the state-of-the-art results on multiple datasets. Moreover, the better generalization capability of S-VVAD is confirmed for cross-dataset and person-independent scenarios.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at WACV 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2021

NORESQA: A Framework for Speech Quality Assessment using Non-Matching References

Pranay Manocha, Buye Xu, Anurag Kumar

Keywords Paper

deep learning, robustness, self-supervised learning

0

0

0

0

14:30

19/08/2021

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Suzhen Wang, Lincheng Li, Yu Ding and
Changjie Fan, Xin Yu

Keywords Paper

Computer Vision, Language and Vision, Motion and Tracking, Structural and Model-Based Approaches, Knowledge Representation and Reasoning

0

0

0

0

8:31

22/11/2021

Talking Head Generation with Audio and Speech Related Facial Action Units

Sen Chen, Zhilei Liu, Jiaxing Liu and
Zhengxiang Yan, Longbiao Wang

Keywords Paper

Talking Face Generation, Facial Action Unit, Generative Adversarial Network, Video Synthesis, Face Manipulation

0

0

0

0

2:41

06/12/2020

Listening to Sounds of Silence for Speech Denoising

Henry Xu, Rundi Wu, Yuko Ishiwaka and
Carl Vondrick, Changxi Zheng

Keywords Paper

0

0

0

0

3:22

03/05/2021

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Efthymios Tzinis, Scott Wisdom, Aren Jansen and
Shawn Hershey, Tal Remez, Dan Ellis, John Hershey

Keywords Paper

self-supervised learning, universal sound separation, in-the-wild data, Audio-visual sound separation, unsupervised learning

0

0

0

0

5:06

14/09/2020

MMCNN: A Multi-branch Multi-scale Convolutional Neural Network for Motor Imagery Classification

Ziyu Jia, Youfang Lin, Jing Wang and
Kaixin Yang, Tianhang Liu, Xinwang Zhang

Keywords Paper

motor imagery, convolutional neural network, eeg signal, brain–computer interface

0

0

0

0

12:20

02/02/2021

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

Yan-Bo Lin, Yu-Chiang Frank Wang

Keywords Paper

0

0

0

0

15:06

30/11/2020

Do We Need Sound for Sound Source Localization?

Takashi Oya, Shohei Iwase, Ryota Natsume and
Takahiro Itazuri, Shugo Yamaguchi, Shigeo Morishima

Keywords Paper

0

0

0

0

8:43

06/12/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian and
Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

Keywords Paper

machine learning, self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

15:59

12/08/2020

Preech: A System for Privacy-Preserving Speech Transcription

Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, Parmesh Ramanathan

Keywords Paper

0

0

0

0

12:02

22/11/2021

Audio-Visual Speech Super-Resolution

Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri, C.V. Jawahar

Keywords Paper

speech super-resolution, audio-visual data, audio-visual learning, pseudo-visual stream, multi-modal learning

0

0

0

0

10:01

22/11/2021

Render In-between: Motion Guided Video Synthesis for Action Interpolation

Hsuan-I Ho, Xu Chen, Jie Song, Otmar Hilliges

Keywords Paper

video interpolation, action prediction, human motion modeling, human generation, human centric video, neural renderer, transformer

0

0

0

0

3:04

02/11/2020

Anomalous sound detection as a simple binary classification problem with careful selection of proxy outlier examples

Paul Primus, Verena Haunschmid, Patrick Praher, Gerhard Widmer

Keywords Paper

0

0

0

0

15:23

02/02/2021

Binaural Audio-Visual Localization

Xinyi Wu, Zhenyao Wu, Lili Ju, Song Wang

Keywords Paper

0

0

0

0

13:42

05/01/2021

Vid2Int: Detecting Implicit Intention From Long Dialog Videos

Xiaoli Xu, Yao Lu, Zhiwu Lu, Tao Xiang

Keywords Paper

0

0

0

0

4:27

03/05/2021

Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning

Siyang Yuan, Pengyu Cheng, Ruiyi Zhang and
Weituo Hao, Zhe Gan, Lawrence Carin

Keywords Paper

Disentanglement, Mutual Information, Zero-shot Learning, Style Transfer

0

0

0

0

5:03

26/04/2020

High Fidelity Speech Synthesis with Adversarial Networks

Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman and
Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

Keywords Paper

texttospeech, speechsynthesis, audiosynthesis, gans, generativeadversarialnetworks, implicitgenerativemodels

0

0

0

0

15:07

25/04/2020

Soundr: Head Position and Orientation Prediction Using a Microphone Array

Jackie Yang, Gaurab Banerjee, Vishesh Gupta and
Monica Lam, James Landay

Keywords Paper

smart speakers, internet of things, machine learning, acoustic source localization

0

0

0

0

13:52

02/02/2021

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Alexander Podolskiy, Dmitry Lipin, Andrey Bout and
Ekaterina Artemova, Irina Piontkovskaya

Keywords Paper

0

0

0

0

16:08

26/04/2020

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

Hyeong-Seok Choi, Changdae Park, Kyogu Lee

Keywords Paper

Multi-modal learning, Self-supervised learning, Voice profiling, Conditional GANs

0

0

0

0

5:15

18/07/2021

Learning de-identified representations of prosody from raw audio

Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

4:37

02/02/2021

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong, Rong Ye, Mingxuan Wang and
Hao Zhou, Shuang Xu, Bo Xu, Lei Li

Keywords Paper

0

0

0

0

14:09

26/04/2020

DDSP: Differentiable Digital Signal Processing

Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, Adam Roberts

Keywords Paper

dsp, audio, music, nsynth, wavenet, wavernn, vocoder, synthesizer, sound, signal, processing, tensorflow, autoencoder, disentanglement

0

0

0

0

5:11

06/12/2020

A Spectral Energy Distance for Parallel Speech Synthesis

Alexey Gritsenko, Tim Salimans, Rianne van den Berg and
Jasper Snoek, Nal Kalchbrenner

Keywords Paper

0

0

0

0

3:11

05/01/2021

Visual Speech Enhancement Without a Real Visual Stream

Sindhu B. Hegde, K.R. Prajwal, Rudrabha Mukhopadhyay and
Vinay P. Namboodiri, C.V. Jawahar

Keywords Paper

0

0

0

0

5:01

18/07/2021

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

5:17

25/07/2020

Auto-annotation for voice-enabled entertainment systems

Wenyan Li, Ferhan Ture

Keywords Paper

unsupervised, voice-enabled entertainment systems, automatic speech recognition, error detection and evaluation, auto-annotation

0

0

0

0

8:06

22/11/2021

A cappella: Audio-visual Singing Voice Separation

Juan Felipe Montesinos, Venkatesh Shenoy Kadandale, Gloria Haro

Keywords Paper

audiovisual, audio-visual, source separation, singing, speech, graph, acappella

0

0

0

0

2:51

02/02/2021

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein

Keywords Paper

0

0

0

0

14:04

02/02/2021

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh and
Ji-Hoon Kim, Seong-Whan Lee

Keywords Paper

0

0

0

0

14:19

18/07/2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang, Ye Xia and
Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig

Keywords Paper

Deep Learning, Embedding and Representation learning

0

0

0

0

21:03

03/05/2021

GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images

Sungmin Cha, Taeeon Park, Byeongjoon Kim and
Jongduk Baek, Taesup Moon

Keywords Paper

generative learning, iterative training, blind denoising, unsupervised learning

0

0

0

0

5:37

16/11/2020

Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents

Samyak Datta, Oleksandr Maksymets, Judy Hoffman and
Stefan Lee, Dhruv Batra Georgia Tech &, Facebook AI Research, Devi Parikh Georgia Tech &, Facebook AI Research

Keywords Paper

0

0

0

0

5:08

06/12/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Yi Ren, Jinglin Liu, Zhou Zhao

Keywords Paper

generative model

0

0

0

0

10:15

22/11/2021

Taming Visually Guided Sound Generation

Vladimir Iashin, Esa Rahtu

Keywords Paper

multi-modal learning, audio generation, video understanding, transformer, VQVAE, MelGAN, perceptual loss, generation metrics, VGGSound, VAS

0

0

0

0

9:54

06/12/2021

TriBERT: Human-centric Audio-visual Representation Learning

Tanzila Rahman, Mengyu Yang, Leonid Sigal

Keywords Paper

transformers, representation learning

0

0

0

0

13:54

02/02/2021

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Lirong Dai

Keywords Paper

0

0

0

0

19:58

14/06/2020

Discriminative Multi-Modality Speech Recognition

Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang

Keywords Paper

multi-modal, audio-visual, speech recognition, lip reading, deep learning, eleatt-gru, deep learning

0

0

0

0

1:01

26/04/2020

Semi-Supervised Generative Modeling for Controllable Speech Synthesis

Raza Habib, Soroosh Mariooryad, Matt Shannon and
Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, David Kao, Tom Bagby

Keywords Paper

TTS, Speech Synthesis, Semi-supervised Models, VAE, disentanglement

0

0

0

0

5:44

12/08/2020

Void: A fast and light voice liveness detection system

Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh and
Iljoo Kim, Taekkyung Oh, Hyoungshick Kim

Keywords Paper

0

0

0

0

12:59