Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

30/11/2020

Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

Miao Liao, Sibo Zhang, Peng Wang, Hao Zhu, Xinxin Zuo, Ruigang Yang

Keywords:

Abstract Paper Similar Papers

Abstract: In this paper, we propose a novel approach to convert a given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.

The video of this talk cannot be embedded. You can watch it here:

https://accv2020.github.io/miniconf/poster_31.html

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ACCV 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2021

A-NeRF: Articulated Neural Radiance Fields for Learning Human Shape, Appearance, and Pose

Shih-Yang Su, Frank Yu, Michael Zollhoefer, Helge Rhodin

Keywords Paper

deep learning, vision, generative model

0

0

0

0

8:02

30/11/2020

Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation

Simon Jenni, Paolo Favaro

Keywords Paper

0

0

0

0

9:58

22/11/2021

Logsig-RNN: a novel network for robust and efficient skeleton-based action recognition

Hao Ni, Shujian Liao, Weixin Yang and
Kevin Schlegel, Terry J Lyons

Keywords Paper

skeleton-based action recognition, recurrent neural network, log-signature

0

0

0

0

2:58

22/11/2021

UNIK: A Unified Framework for Real-world Skeleton-based Action Recognition

Di Yang, Yaohui Wang, Antitza Dantcheva and
Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Keywords Paper

deep learning, video understanding, action recognition, skeleton, 2D pose, 3D pose, graph convolution, attention, real-world, dataset

0

0

0

0

9:19

06/12/2021

Direct Multi-view Multi-person 3D Pose Estimation

tao wang, Jianfeng Zhang, Yujun Cai and
Shuicheng Yan, Jiashi Feng

Keywords Paper

transformers, vision

0

0

0

0

14:40

06/12/2021

Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering

Youngjoong Kwon, Dahun Kim, Duygu Ceylan, Henry Fuchs

Keywords Paper

transformers, vision

0

0

0

0

11:55

14/06/2020

PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition

Kun Su, Xiulong Liu, Eli Shlizerman

Keywords Paper

unsupervised learning, human skeleton, action recognition, spatial-temporal sequence, encoder-decoder, recurrent neural network, k-nearest neighborhood, clustering actions, motion prediction, body movements

0

0

0

0

0:58

02/02/2021

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Lincheng Li, Suzhen Wang, Zhimeng Zhang and
Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan

Keywords Paper

0

0

0

0

15:58

17/08/2020

Skeleton-aware networks for deep motion retargeting

Kfir Aberman, Peizh Uo Li, Dani Lischinski and
Olga Sorkine-Hornung, Daniel Cohen-Or, Baoquan Chen

Keywords Paper

motion retargeting, neural motion processing

0

0

0

0

17:32

06/12/2021

H-NeRF: Neural Radiance Fields for Rendering and Temporal Reconstruction of Humans in Motion

Hongyi Xu, Thiemo Alldieck, Cristian Sminchisescu

Keywords Paper

robustness

0

0

0

0

8:39

05/01/2021

Hand Pose Guided 3D Pooling for Word-Level Sign Language Recognition

Al Amin Hosain, Panneer Selvam Santhalingam, Parth Pathak and
Huzefa Rangwala, Jana Kosecka

Keywords Paper

0

0

0

0

4:39

14/06/2020

Bi-Directional Interaction Network for Person Search

Wenkai Dong, Zhaoxiang Zhang, Chunfeng Song, Tieniu Tan

Keywords Paper

person search, person detection and re-identification, bi-directional interaction

0

0

0

0

0:59

19/08/2021

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Suzhen Wang, Lincheng Li, Yu Ding and
Changjie Fan, Xin Yu

Keywords Paper

Computer Vision, Language and Vision, Motion and Tracking, Structural and Model-Based Approaches, Knowledge Representation and Reasoning

0

0

0

0

8:31

17/08/2020

XNect: Real-time multi-person 3D motion capture with a single RGB camera

Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller and
Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, Christian Theobalt

Keywords Paper

human body pose, motion capture, real-time, RGB, monocular

0

0

0

0

6:40

16/11/2020

Self-Supervised Object-in-Gripper Segmentation from Robotic Motions

Wout Boerdijk, Martin Sundermeyer, Maximilian Durner, Rudolph Triebel

Keywords Paper

0

0

0

0

5:03

19/08/2021

Improving Stylized Neural Machine Translation with Iterative Dual Knowledge Transfer

Xuanxuan Wu, Jian Liu, Xinjie Li and
Jinan Xu, Yufeng Chen, Yujie Zhang, Hui Huang

Keywords Paper

Natural Language Processing, Machine Translation, Natural Language Generation

0

0

0

0

12:35

16/11/2020

Self-Supervised 3D Keypoint Learning for Ego-Motion Estimation

Jiexiong Tang, Rareș Ambruș, Vitor Guizilini and
Sudeep Pillai, Hanme Kim, Patric Jensfelt, Adrien Gaidon

Keywords Paper

0

0

0

0

5:05

25/07/2020

Learning discriminative joint embeddings for efficient face and voice association

Rui Wang, Xin Liu, Yiu-ming Cheung and
Kai Cheng, Nannan Wang, Wentao Fan

Keywords Paper

bi-directional ranking constraint, face-voice association, cross-modal verification, discriminative joint embedding

0

0

0

0

8:33

05/01/2021

Where to Look?: Mining Complementary Image Regions for Weakly Supervised Object Localization

Sadbhavana Babar, Sukhendu Das

Keywords Paper

0

0

0

0

5:01

02/02/2021

Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control

Yu-Wei Chao, Jimei Yang, Weifeng Chen, Jia Deng

Keywords Paper

0

0

0

0

19:45

04/07/2020

Using Context in Neural Machine Translation Training Objectives

Danielle Saunders, Felix Stahlberg, Bill Byrne

Keywords Paper

Neural training, NMT training, document-level training, NMT objective

0

0

0

0

6:48

22/11/2021

Segmenting Invisible Moving Objects

Hala Lamdouar, Weidi Xie, Andrew Zisserman

Keywords Paper

synthetic data generation, motion segmentation, amodal segmentation, video camouflage breaking, self-attention

0

0

0

0

3:05

14/06/2020

MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation

Rongchang Xie, Chunyu Wang, Yizhou Wang

Keywords Paper

human pose estimation, multi-view feature fusion, meta learning, transfer learning

0

0

0

0

1:01

22/11/2021

WP2-GAN: Wavelet-based Multi-level GAN for Progressive Facial Expression Translation with Parallel Generators

Jun Shao, Tien Bui

Keywords Paper

expression translation, parallel training, progressive training, wavelet packet transform, multi-level GAN

0

0

0

0

3:13

26/04/2020

MetaPix: Few-Shot Video Retargeting

Jessica Lee, Deva Ramanan, Rohit Girdhar

Keywords Paper

Meta-learning, Few-shot Learning, Generative Adversarial Networks, Video Retargeting

0

0

0

0

5:14

06/12/2021

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

Shaofei Wang, Marko Mihajlovic, Qianli Ma and
Andreas Geiger, Siyu Tang

Keywords Paper

deep learning, optimization, vision, meta learning

0

0

0

0

9:17

06/12/2021

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Mingyu Ding, Zhenfang Chen, Tao Du and
Ping Luo, Josh Tenenbaum, Chuang Gan

Keywords Paper

interpretability

0

0

0

0

9:42

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

05/01/2021

Synthetic Expressions Are Better Than Real for Learning to Detect Facial Actions

Koichiro Niinuma, Itir Onal Ertugrul, Jeffrey F. Cohn, Laszlo A. Jeni

Keywords Paper

0

0

0

0

4:59

14/06/2020

Interpreting the Latent Space of GANs for Semantic Face Editing

Yujun Shen, Jinjin Gu, Xiaoou Tang, Bolei Zhou

Keywords Paper

generative adversarial network, network interpretation, face editing

0

0

0

0

1:01

30/11/2020

Domain-transferred Face Augmentation Network

Hao-Chiang Shao, Kang-Yu Liu, Chia-Wen Lin, Jiwen Lu

Keywords Paper

0

0

0

0

9:47

14/09/2020

Unsupervised Human Pose Estimation on Depth Images

Thibault Blanc Beyne, Axel Carlier, Sandrine Mouysset, Vincent Charvillat

Keywords Paper

depth images, unsupervised learning, human pose estimation, image-to-image translation

0

0

0

0

15:29

05/01/2021

A Multi-Task Learning Approach for Human Activity Segmentation and Ergonomics Risk Assessment

Behnoosh Parsa, Ashis G. Banerjee

Keywords Paper

0

0

0

0

4:53

06/12/2021

Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Itay Hubara, Brian Chmiel, Moshe Island and
Ron Banner, Joseph Naor, Daniel Soudry

Keywords Paper

deep learning

0

0

0

0

11:02

05/01/2021

3D Human Pose and Shape Estimation Through Collaborative Learning and Multi-View Model-Fitting

Zhongguo Li, Magnus Oskarsson, Anders Heyden

Keywords Paper

0

0

0

0

5:13

16/11/2020

Transformers for One-Shot Visual Imitation

Sudeep Dasari, Abhinav Gupta

Keywords Paper

0

0

0

0

5:06

16/11/2020

Visual Imitation Made Easy

Sarah Young, Dhiraj Gandhi, Shubham Tulsiani and
Abhinav Gupta, Pieter Abbeel, Lerrel Pinto

Keywords Paper

0

0

0

0

5:06

14/06/2020

Deep Homography Estimation for Dynamic Scenes

Hoang Le, Feng Liu, Shu Zhang, Aseem Agarwala

Keywords Paper

homography estimation, dynamic scenes, motion estimation, multi-task learning, deep learning

0

0

0

0

1:01

05/01/2021

Real-Time RGBD-Based Extended Body Pose Estimation

Renat Bashirov, Anastasia Ianina, Karim Iskakov and
Yevgeniy Kononenko, Valeriya Strizhkova, Victor Lempitsky, Alexander Vakhitov

Keywords Paper

0

0

0

0

4:53

14/06/2020

GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models

Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir and
William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu

Keywords Paper

generative human models, full body, end-to-end, human reconstructions, skinning, facial expressions

0

0

0

0

5:00