TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

02/02/2021

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Lirong Dai

Keywords:

Abstract Paper Similar Papers

Abstract: This paper presents TaLNet, a model for voice reconstruction with ultrasound tongue and optical lip videos as inputs. TaLNet is based on an encoder-decoder architecture. Separate encoders are dedicated to processing the tongue and lip data streams respectively. The decoder predicts acoustic features conditioned on encoder outputs and speaker codes.To mitigate for having only relatively small amounts of dual articulatory-acoustic data available for training, and since our task here shares with text-to-speech (TTS) the common goal of speech generation, we propose a novel transfer learning strategy to exploit the much larger amounts of acoustic-only data available to train TTS models. For this, a Tacotron 2 TTS model is first trained, and then the parameters of its decoder are transferred to the TaLNet decoder. We have evaluated our approach on an unconstrained multi-speaker voice recovery task. Our results show the effectiveness of both the proposed model and the transfer learning strategy. Speech reconstructed using our proposed method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. When using an ASR model decoding the recovery speech, the WER of our proposed method is relatively reduced over 30% compared to baselines.

The video of this talk cannot be embedded. You can watch it here:

https://slideslive.com/38947997

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at AAAI 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

02/02/2021

Exploring Transfer Learning For End-to-End Spoken Language Understanding

Subendhu Rongali, Beiye Liu, Liwei Cai and
Konstantine Arkoudas, Chengwei Su, Wael Hamza

Keywords Paper

0

0

0

0

19:30

18/07/2021

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

Chenfeng Miao, Liang Shuang, Zhengchen Liu and
Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

5:13

26/04/2020

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, Wei-Ning Hsu, James Glass

Keywords Paper

visually-grounded speech, self-supervised learning, discrete representation learning, vision and language, vision and speech, hierarchical representation learning

0

0

0

0

13:42

03/05/2021

Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning

Siyang Yuan, Pengyu Cheng, Ruiyi Zhang and
Weituo Hao, Zhe Gan, Lawrence Carin

Keywords Paper

Disentanglement, Mutual Information, Zero-shot Learning, Style Transfer

0

0

0

0

5:03

08/12/2020

Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks

Ori Terner, Kfir Bar, Nachum Dershowitz

Keywords Paper

0

0

0

0

9:50

01/07/2020

End-to-End Speech Translation with Adversarial Training

Xuancai Li, Chen Kehai, Tiejun Zhao, Muyun Yang

Keywords Paper

0

0

0

0

8:53

04/07/2020

Jointly Masked Sequence-to-Sequence Model for Non-Autoregressive Neural Machine Translation

Junliang Guo, Linli Xu, Enhong Chen

Keywords Paper

Non-Autoregressive Translation, natural tasks, non-autoregressive translation~(NAT, non-autoregressive

0

0

0

0

10:47

02/02/2021

Multi-SpectroGAN: High-Diversity and High-Fidelity Spectrogram Generation with Adversarial Style Combination for Speech Synthesis

Sang-Hoon Lee, Hyun-Wook Yoon, Hyeong-Rae Noh and
Ji-Hoon Kim, Seong-Whan Lee

Keywords Paper

0

0

0

0

14:19

06/12/2021

PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition

Cheng-I Jeff Lai, Yang Zhang, Alexander Liu and
Shiyu Chang, Yi-Lun Liao, Yung-Sung Chuang, Kaizhi Qian, Sameer Khurana, David Cox, Jim Glass

Keywords Paper

self-supervised learning, representation learning

0

0

0

0

13:57

06/12/2020

Unsupervised Sound Separation Using Mixture Invariant Training

Scott Wisdom, Efthymios Tzinis, Hakan Erdogan and
Ron Weiss, Kevin Wilson, John R. Hershey

Keywords Paper

0

0

0

0

3:20

02/02/2021

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong, Rong Ye, Mingxuan Wang and
Hao Zhou, Shuang Xu, Bo Xu, Lei Li

Keywords Paper

0

0

0

0

14:09

11/10/2020

Zero-shot Singing Voice Conversion

Shahan Nercessian

Keywords Paper

MIR tasks, Music synthesis and transformation, Domain knowledge, Machine learning/Artificial intelligence for music, Musical features and properties, Timbre, instrumentation, and voice

0

0

0

0

2:51

26/04/2020

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

Hyeong-Seok Choi, Changdae Park, Kyogu Lee

Keywords Paper

Multi-modal learning, Self-supervised learning, Voice profiling, Conditional GANs

0

0

0

0

5:15

12/07/2020

Non-Autoregressive Neural Text-to-Speech

Kainan Peng, Wei Ping, Zhao Song, Kexin Zhao

Keywords Paper

Applications - Language, Speech and Dialog

0

0

0

0

15:12

03/05/2021

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Rafael Valle, Kevin J Shih, Ryan Prenger, Bryan Catanzaro

Keywords Paper

normalizing flows, deep learning, Text to speech synthesis

0

0

0

0

5:11

06/12/2020

Uncertainty-aware Self-training for Few-shot Text Classification

Subhabrata Mukherjee, Ahmed Awadallah

Keywords Paper

0

0

0

0

3:16

18/07/2021

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Dongchan Min, Dong Bok Lee, Eunho Yang, Sung Ju Hwang

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

5:17

19/04/2021

Neural data-to-text generation with LM-based text augmentation

Ernie Chang, Xiaoyu Shen, Dawei Zhu and
Vera Demberg, Hui Su

Keywords Paper

0

0

0

0

7:32

04/07/2020

Cross-Lingual Unsupervised Sentiment Classification with Multi-View Transfer Learning

Hongliang Fei, Ping Li

Keywords Paper

Cross-Lingual Classification, sentiment classification, unsupervised system, classification

0

0

0

0

12:23

18/07/2021

Learning de-identified representations of prosody from raw audio

Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

4:37

06/12/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Hassan Akbari, Liangzhe Yuan, Rui Qian and
Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

Keywords Paper

machine learning, self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

15:59

02/11/2020

Model selection for deep audio source separation via clustering analysis

Alisa Liu, Prem Seetharaman, Bryan Pardo

Keywords Paper

0

0

0

0

12:12

03/05/2021

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

Yi Ren, Chenxu Hu, Xu Tan and
Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Keywords Paper

end-to-end, non-autoregressive generation, speech synthesis, one-to-many mapping, text to speech

0

0

0

0

7:01

06/12/2020

Listening to Sounds of Silence for Speech Denoising

Henry Xu, Rundi Wu, Yuko Ishiwaka and
Carl Vondrick, Changxi Zheng

Keywords Paper

0

0

0

0

3:22

04/07/2020

Towards end-2-end learning for predicting behavior codes from spoken utterances in psychotherapy conversations

Karan Singla, Zhuohao Chen, David Atkins, Shrikanth Narayanan

Keywords Paper

predicting codes, Spoken tasks, voice detection, speaker diarization

0

0

0

0

7:16

19/08/2021

A Streaming End-to-End Framework For Spoken Language Understanding

Nihal Potdar, Anderson Raymundo Avila, Chao Xing and
Dong Wang, Yiran Cao, Xiao Chen

Keywords Paper

Natural Language Processing, Dialogue, Speech

0

0

0

0

14:09

06/12/2021

Understanding Adaptive, Multiscale Temporal Integration In Deep Speech Recognition Systems

Menoua Keshishian, Samuel Norman-Haignere, Nima Mesgarani

Keywords Paper

deep learning, machine learning

0

0

0

0

10:28

08/12/2020

Attentively Embracing Noise for Robust Latent Representation in BERT

Gwenaelle Cunha Sergio, Dennis Singh Moirangthem, Minho Lee

Keywords Paper

0

0

0

0

12:55

18/07/2021

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:19

16/11/2020

Named Entity Recognition Only from Word Embeddings

Ying Luo, Hai Zhao, Junlang Zhan

Keywords Paper

named recognition, entity detection, type prediction, deep models

0

0

0

0

9:54

16/11/2020

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

Isabel Papadimitriou, Dan Jurafsky

Keywords Paper

analyzing structure, encoding structure, natural acquisition, transfer learning

0

0

0

0

11:44

03/05/2021

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech

Yoonhyung Lee, Joongbo Shin, Kyomin Jung

Keywords Paper

VAE, non-autoregressive, speech synthesis, text-to-speech

0

0

0

0

5:40

25/07/2020

Auto-annotation for voice-enabled entertainment systems

Wenyan Li, Ferhan Ture

Keywords Paper

unsupervised, voice-enabled entertainment systems, automatic speech recognition, error detection and evaluation, auto-annotation

0

0

0

0

8:06

05/12/2020

Systematic generalization on gSCAN with language conditioned embedding

Tong Gao, Qi Huang, Raymond Mooney

Keywords Paper

0

0

0

0

14:19

26/04/2020

Distance-Based Learning from Errors for Confidence Calibration

Chen Xing, Sercan Arik, Zizhao Zhang, Tomas Pfister

Keywords Paper

Confidence Calibration, Uncertainty Estimation, Prototypical Learning

0

0

0

0

5:09

06/12/2021

Unsupervised Speech Recognition

Alexei Baevski, Wei-Ning Hsu, Alexis CONNEAU, Michael Auli

Keywords Paper

deep learning, adversarial robustness and security, self-supervised learning, generative model

0

0

0

0

19:16

05/01/2021

Visual Speech Enhancement Without a Real Visual Stream

Sindhu B. Hegde, K.R. Prajwal, Rudrabha Mukhopadhyay and
Vinay P. Namboodiri, C.V. Jawahar

Keywords Paper

0

0

0

0

5:01

03/05/2021

End-to-end Adversarial Text-to-Speech

Jeff Donahue, Sander Dieleman, Mikolaj Binkowski and
Erich Elsen, Karen Simonyan

Keywords Paper

end-to-end, speech synthesis, feed-forward, text-to-speech, adversarial, generative model, GAN

0

0

0

0

15:23

06/12/2020

A Spectral Energy Distance for Parallel Speech Synthesis

Alexey Gritsenko, Tim Salimans, Rianne van den Berg and
Jasper Snoek, Nal Kalchbrenner

Keywords Paper

0

0

0

0

3:11

03/05/2021

Self-supervised Representation Learning with Relative Predictive Coding

Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang and
Han Zhao, LP Morency, Ruslan Salakhutdinov

Keywords Paper

dependency based method, self-supervised learning, contrastive learning

0

0

0

0

6:02