Audio-Visual Speech Super-Resolution

22/11/2021

Audio-Visual Speech Super-Resolution

Rudrabha Mukhopadhyay, Sindhu B Hegde, Vinay Namboodiri, C.V. Jawahar

Keywords: speech super-resolution, audio-visual data, audio-visual learning, pseudo-visual stream, multi-modal learning

Abstract Paper Code Similar Papers

Abstract: In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8x and 16x). Previous works attempted to solve this problem using only the audio modality as input, and thus were limited to low scale-factors of 2x and 4x. In contrast, we propose to incorporate both visual and auditory signals to super-resolve speech of sampling rates as low as 1kHz. In such challenging situations, the visual features assist in learning the content, and improves the quality of the generated speech. Further, we demonstrate the applicability of our approach to arbitrary speech signals where the visual stream is not accessible. Our "pseudo-visual network" precisely synthesizes the visual stream solely from the low-resolution speech input. Extensive experiments illustrate our method's remarkable results and benefits over state-of-the-art audio-only speech super-resolution approaches. Our project website can be found at http://cvit.iiit.ac.in/research/projects/cvit-projects/audio-visual-speech-super-resolution.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at BMVC 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2021

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Yi Ren, Jinglin Liu, Zhou Zhao

Keywords Paper

generative model

0

0

0

0

10:15

06/12/2020

Listening to Sounds of Silence for Speech Denoising

Henry Xu, Rundi Wu, Yuko Ishiwaka and
Carl Vondrick, Changxi Zheng

Keywords Paper

0

0

0

0

3:22

06/12/2021

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

Cheng-I Jeff Lai, Yang Zhang, Alexander Liu and
Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, Jim Glass

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

13:57

12/08/2020

Void: A fast and light voice liveness detection system

Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh and
Iljoo Kim, Taekkyung Oh, Hyoungshick Kim

Keywords Paper

0

0

0

0

12:59

02/02/2021

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh and
Ji-Hoon Kim, Seong-Whan Lee

Keywords Paper

0

0

0

0

14:19

03/05/2021

End-to-end Adversarial Text-to-Speech

Jeff Donahue, Sander Dieleman, Mikolaj Binkowski and
Erich Elsen, Karen Simonyan

Keywords Paper

end-to-end, speech synthesis, feed-forward, text-to-speech, adversarial, generative model, GAN

0

0

0

0

15:23

12/07/2020

Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

0

15:12

02/02/2021

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong, Rong Ye, Mingxuan Wang and
Hao Zhou, Shuang Xu, Bo Xu, Lei Li

Keywords Paper

0

0

0

0

14:09

05/01/2021

Visual Speech Enhancement Without a Real Visual Stream

Sindhu B. Hegde, K.R. Prajwal, Rudrabha Mukhopadhyay and
Vinay P. Namboodiri, C.V. Jawahar

Keywords Paper

0

0

0

0

5:01

06/12/2020

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

Keywords Paper

0

0

0

0

3:37

26/04/2020

DDSP: Differentiable Digital Signal Processing

Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, Adam Roberts

Keywords Paper

dsp, audio, music, nsynth, wavenet, wavernn, vocoder, synthesizer, sound, signal, processing, tensorflow, autoencoder, disentanglement

0

0

0

0

5:11

18/07/2021

Learning de-identified representations of prosody from raw audio

Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

4:37

02/02/2021

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Lirong Dai

Keywords Paper

0

0

0

0

19:58

12/07/2020

Unsupervised Speech Decomposition via Triple Information Bottleneck

Kaizhi Qian, Yang Zhang, Shiyu Chang and
Mark Hasegawa-Johnson, David Cox

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

0

12:34

03/05/2021

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Efthymios Tzinis, Scott Wisdom, Aren Jansen and
Shawn Hershey, Tal Remez, Dan Ellis, John Hershey

Keywords Paper

self-supervised learning, universal sound separation, in-the-wild data, Audio-visual sound separation, unsupervised learning

0

0

0

0

5:06

14/06/2020

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, C.V. Jawahar

Keywords Paper

lip to speech, lip reading, speech generation, talking face videos, lip2wav, sequence to sequence, multimodal learning, speech restoration, audio-visual understanding, face and speech

0

0

0

0

1:01

08/12/2020

Emergent Communication Pretraining for Few-Shot Machine Translation

Yaoyiran Li, Edoardo Maria Ponti, Ivan Vulić, Anna Korhonen

Keywords Paper

0

0

0

0

14:42

14/06/2020

Discriminative Multi-Modality Speech Recognition

Bo Xu, Cheng Lu, Yandong Guo, Jacob Wang

Keywords Paper

multi-modal, audio-visual, speech recognition, lip reading, deep learning, eleatt-gru, deep learning

0

0

0

0

1:01

18/07/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:19

30/11/2020

Do We Need Sound for Sound Source Localization?

Takashi Oya, Shohei Iwase, Ryota Natsume and
Takahiro Itazuri, Shugo Yamaguchi, Shigeo Morishima

Keywords Paper

0

0

0

0

8:43

06/12/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian and
Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

Keywords Paper

machine learning, self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

15:59

05/01/2021

S-VVAD: Visual Voice Activity Detection by Motion Segmentation

Muhammad Shahid, Cigdem Beyan, Vittorio Murino

Keywords Paper

0

0

0

0

4:56

19/04/2021

Disfluency correction using unsupervised and semi-supervised learning

Nikhil Saini, Drumil Trivedi, Shreya Khare and
Tejas Dhamecha, Preethi Jyothi, Samarth Bharadwaj, Pushpak Bhattacharyya

Keywords Paper

0

0

0

0

7:13

18/07/2021

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

Chenfeng Miao, Liang Shuang, Zhengchen Liu and
Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

5:13

06/12/2021

NORESQA: A Framework for Speech Quality Assessment using Non-Matching References

Pranay Manocha, Buye Xu, Anurag Kumar

Keywords Paper

deep learning, robustness, self-supervised learning

0

0

0

0

14:30

03/05/2021

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Keywords Paper

VAE, non-autoregressive, speech synthesis, text-to-speech

0

0

0

0

5:40

19/04/2021

On-device text representations robust to misspellings via projections

Chinnadhurai Sankar, Sujith Ravi, Zornitsa Kozareva

Keywords Paper

0

0

0

0

4:21

14/06/2020

A Physics-Based Noise Formation Model for Extreme Low-Light Raw Denoising

Kaixuan Wei, Ying Fu, Jiaolong Yang, Hua Huang

Keywords Paper

extreme low-light imaging, physics-based noise modeling, extreme low-light denoising dataset

0

0

0

0

4:58

26/04/2020

High Fidelity Speech Synthesis with Adversarial Networks

Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman and
Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, Karen Simonyan

Keywords Paper

texttospeech, speechsynthesis, audiosynthesis, gans, generativeadversarialnetworks, implicitgenerativemodels

0

0

0

0

15:07

02/02/2021

Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users

Moussa Doumbouya, Lisa Einstein, Chris Piech

Keywords Paper

0

0

0

0

16:06

14/09/2020

MMCNN: A Multi-branch Multi-scale Convolutional Neural Network for Motor Imagery Classification

Ziyu Jia, Youfang Lin, Jing Wang and
Kaixin Yang, Tianhang Liu, Xinwang Zhang

Keywords Paper

motor imagery, convolutional neural network, eeg signal, brain–computer interface

0

0

0

0

12:20

05/01/2021

Deep Interactive Thin Object Selection

Jun Hao Liew, Scott Cohen, Brian Price and
Long Mai, Jiashi Feng

Keywords Paper

0

0

0

0

4:48

06/12/2021

Unsupervised Speech Recognition

Alexei Baevski, Wei-Ning Hsu, Alexis CONNEAU, Michael Auli

Keywords Paper

deep learning, adversarial robustness and security, self-supervised learning, generative model

0

0

0

0

19:16

08/12/2020

Fine-grained Information Status Classification Using Discourse Context-Aware BERT

Yufang Hou

Keywords Paper

0

0

0

0

13:13

02/02/2021

Interactive Speech and Noise Modeling for Speech Enhancement

Chengyu Zheng, Xiulian Peng, Yuan Zhang and
Sriram Srinivasan, Yan Lu

Keywords Paper

0

0

0

0

14:47

03/05/2021

Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning

Siyang Yuan, Pengyu Cheng, Ruiyi Zhang and
Weituo Hao, Zhe Gan, Lawrence Carin

Keywords Paper

Disentanglement, Mutual Information, Zero-shot Learning, Style Transfer

0

0

0

0

5:03

26/04/2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, Wei-Ning Hsu, James Glass

Keywords Paper

visually-grounded speech, self-supervised learning, discrete representation learning, vision and language, vision and speech, hierarchical representation learning

0

0

0

0

13:42

22/11/2021

Taming Visually Guided Sound Generation

Vladimir Iashin, Esa Rahtu

Keywords Paper

multi-modal learning, audio generation, video understanding, transformer, VQVAE, MelGAN, perceptual loss, generation metrics, VGGSound, VAS

0

0

0

0

9:54

18/07/2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang, Ye Xia and
Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig

Keywords Paper

Deep Learning, Embedding and Representation learning

0

0

0

0

21:03

07/09/2020

NTGAN: Learning Blind Image Denoising without Clean Reference

Rui Zhao, Daniel P.K. Lun, Kin-Man Lam

Keywords Paper

unsupervised image denoising, blind image denoising, pseudo supervision, noise transference

0

0

0

0

6:14