Speech2Action: Cross-Modal Supervision for Action Recognition

14/06/2020

Speech2Action: Cross-Modal Supervision for Action Recognition

Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

Keywords: action recognition, cross-modal, weak supervision, deep learning, movies, video understanding, speech, classification, multimodal, bert

Abstract Paper Similar Papers

Abstract: Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at CVPR 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

04/07/2020

Speaker Sensitive Response Evaluation Model

JinYeong Bak, Alice Oh

Keywords Paper

Speaker Model, Automatic generation, open-domain generation, automatic models

0

0

0

0

10:40

22/11/2021

Talking Head Generation with Audio and Speech Related Facial Action Units

Sen Chen, Zhilei Liu, Jiaxing Liu and
Zhengxiang Yan, Longbiao Wang

Keywords Paper

Talking Face Generation, Facial Action Unit, Generative Adversarial Network, Video Synthesis, Face Manipulation

0

0

0

0

2:41

06/12/2021

Neural Dubber: Dubbing for Videos According to Scripts

Chenxu Hu, Qiao Tian, Tingle Li and
Wang Yuping, Yuxuan Wang, Hang Zhao

Keywords Paper

deep learning

0

0

0

0

7:04

01/07/2020

How to Tame Your Data: Data Augmentation for Dialog State Tracking

Adam Summerville, Jordan Hashemi, James Ryan, William Ferguson

Keywords Paper

0

0

0

0

15:26

25/07/2020

Auto-annotation for voice-enabled entertainment systems

Wenyan Li, Ferhan Ture

Keywords Paper

unsupervised, voice-enabled entertainment systems, automatic speech recognition, error detection and evaluation, auto-annotation

0

0

0

0

8:06

02/02/2021

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Lincheng Li, Suzhen Wang, Zhimeng Zhang and
Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan

Keywords Paper

0

0

0

0

15:58

19/04/2021

DOCENT: Learning self-supervised entity representations from large document collections

Yury Zemlyanskiy, Sudeep Gandhe, Ruining He and
Bhargav Kanagal, Anirudh Ravula, Juraj Gottweis, Fei Sha, Ilya Eckstein

Keywords Paper

0

0

0

0

6:37

16/11/2020

MovieChats: Chat like Humans in a Closed Domain

Hui Su, Xiaoyu Shen, Zhou Xiao and
Zheng Zhang, Ernie Chang, Cheng Zhang, Cheng Niu, Jie Zhou

Keywords Paper

in-depth chat, intent prediction, knowledge retrieval, neural approach

0

0

0

0

10:05

22/11/2021

Visual Keyword Spotting with Attention

Prajwal K R, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

Keywords Paper

visual keyword spotting, lip reading

0

0

0

0

2:53

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

03/05/2021

Unsupervised Audiovisual Synthesis via Exemplar Autoencoders

Kangle Deng, Aayush Bansal, Deva Ramanan

Keywords Paper

voice conversion, assistive technology, audiovisual synthesis, autoencoders, speech-impaired, unsupervised learning

0

0

0

0

5:09

02/02/2021

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong, Rong Ye, Mingxuan Wang and
Hao Zhou, Shuang Xu, Bo Xu, Lei Li

Keywords Paper

0

0

0

0

14:09

04/07/2020

Towards end-2-end learning for predicting behavior codes from spoken utterances in psychotherapy conversations

Karan Singla, Zhuohao Chen, David Atkins, Shrikanth Narayanan

Keywords Paper

predicting codes, Spoken tasks, voice detection, speaker diarization

0

0

0

0

7:16

06/12/2020

Listening to Sounds of Silence for Speech Denoising

Henry Xu, Rundi Wu, Yuko Ishiwaka and
Carl Vondrick, Changxi Zheng

Keywords Paper

0

0

0

0

3:22

05/01/2021

S-VVAD: Visual Voice Activity Detection by Motion Segmentation

Muhammad Shahid, Cigdem Beyan, Vittorio Murino

Keywords Paper

0

0

0

0

4:56

08/12/2020

Attentively Embracing Noise for Robust Latent Representation in BERT

Gwenaelle Cunha Sergio, Dennis Singh Moirangthem, Minho Lee

Keywords Paper

0

0

0

0

12:55

02/02/2021

Exploring Transfer Learning For End-to-End Spoken Language Understanding

Subendhu Rongali, Beiye Liu, Liwei Cai and
Konstantine Arkoudas, Chengwei Su, Wael Hamza

Keywords Paper

0

0

0

0

19:30

14/06/2020

Transferring Cross-Domain Knowledge for Video Sign Language Recognition

Dongxu Li, Xin Yu, Chenchen Xu and
Lars Petersson, Hongdong Li

Keywords Paper

sign language recognition, video classification, transfer learning, action recognition, semisupervised learning, domain adaptation, vision and language, human pose, few-shot learning

0

0

0

0

4:56

02/02/2021

Interpretable Self-Supervised Facial Micro-Expression Learning to Predict Cognitive State and Neurological Disorders

Arun Das, Jeffrey Mock, Yufei Huang and
Edward Golob, Peyman Najafirad

Keywords Paper

0

0

0

0

17:56

02/02/2021

DDRel: A New Dataset for Interpersonal Relation Classification in Dyadic Dialogues

Qi Jia, Hongru Huang, Kenny Q. Zhu

Keywords Paper

0

0

0

0

14:54

04/07/2020

Grounding Conversations with Improvised Dialogues

Hyundong Cho, Jonathan May

Keywords Paper

Grounding Conversations, dialogue systems, bootstrapped classifier, chit-chat systems

0

0

0

0

11:34

02/02/2021

Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection and Slot Filling

Jixuan Wang, Kai Wei, Martin Radfar and
Weiwei Zhang, Clement Chung

Keywords Paper

0

0

0

0

19:31

06/12/2021

CLIP-It! Language-Guided Video Summarization

Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

Keywords Paper

transformers

0

0

0

0

6:14

19/04/2021

WER-BERT: Automatic WER estimation with BERT in a balanced ordinal classification paradigm

Akshay Krishna Sheshadri, Anvesh Rao Vijjini, Sukhdeep Kharbanda

Keywords Paper

0

0

0

0

11:45

16/11/2020

Digital Voicing of Silent Speech

David Gaddy, Dan Klein

Keywords Paper

digitally speech, speech models, emg, silently words

0

0

0

0

10:56

16/11/2020

Predicting In-game Actions from Interviews of NBA Players

Nadav Oved, Amir Feder, Roi Reichart

Keywords Paper

computer science, player prediction, text tasks, players prediction

0

0

0

0

11:59

22/11/2021

AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation

Sanjoy Chowdhury, Aditya Patra, Subhrajyoti Dasgupta, Ujjwal Bhattacharya

Keywords Paper

video summarization, audio-viusal summarization, multi-modal learning, self-supervised learning, contrastive loss

0

0

0

0

3:05

18/07/2021

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

Chenfeng Miao, Liang Shuang, Zhengchen Liu and
Chen Minchuan, Jun Ma, Shaojun Wang, Jing Xiao

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

5:13

30/11/2020

Watch, read and lookup: learning to spot signs from multiple supervisors

Liliane Momeni, Gul Varol, Samuel Albanie and
Triantafyllos Afouras, Andrew Zisserman

Keywords Paper

0

0

0

0

9:58

19/08/2021

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Suzhen Wang, Lincheng Li, Yu Ding and
Changjie Fan, Xin Yu

Keywords Paper

Computer Vision, Language and Vision, Motion and Tracking, Structural and Model-Based Approaches, Knowledge Representation and Reasoning

0

0

0

0

8:31

17/08/2020

Unpaired motion style transfer from video to animation

Kfir Aberman, Yijia Weng, Dani Lischinski and
Daniel Cohen-Or, Baoquan Chen

Keywords Paper

style transfer, motion analysis

0

0

0

0

16:08

02/02/2021

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Alexander Podolskiy, Dmitry Lipin, Andrey Bout and
Ekaterina Artemova, Irina Piontkovskaya

Keywords Paper

0

0

0

0

16:08

02/02/2021

Converse, Focus and Guess - Towards Multi-Document Driven Dialogue

Han Liu, Caixia Yuan, Xiaojie Wang and
Yushu Yang, Huixing Jiang, Zhongyuan Wang

Keywords Paper

0

0

0

0

17:28

19/04/2021

Streaming models for joint speech recognition and translation

Orion Weller, Matthias Sperber, Christian Gollan, Joris Kluivers

Keywords Paper

0

0

0

0

5:11

07/09/2020

Seeing wake words: Audio-visual Keyword Spotting

Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis and
Samuel Albanie, Andrew Zisserman

Keywords Paper

keyword spotting, wake word recognition, zero-shot, audio-visual, lip reading, speech recognition, retrieval

0

0

0

0

9:30

02/02/2021

Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue

Longxiang Liu, Zhuosheng Zhang, Hai Zhao and
Xi Zhou, Xiang Zhou

Keywords Paper

0

0

0

0

18:11

05/01/2021

Visual Speech Enhancement Without a Real Visual Stream

Sindhu B. Hegde, K.R. Prajwal, Rudrabha Mukhopadhyay and
Vinay P. Namboodiri, C.V. Jawahar

Keywords Paper

0

0

0

0

5:01

02/02/2021

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Lirong Dai

Keywords Paper

0

0

0

0

19:58

18/07/2021

Global Prosody Style Transfer Without Text Transcriptions

Kaizhi Qian, Yang Zhang, Shiyu Chang and
Jinjun Xiong, Chuang Gan, David Cox, Mark Hasegawa-Johnson

Keywords Paper

Applications, Audio and Speech Processing

0

0

0

0

20:43

16/11/2020

Multilingual Denoising Pre-training for Neural Machine Translation

Jiatao Gu, Yinhan Liu, Naman Goyal and
Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer

Keywords Paper

machine tasks, pre-training, multilingual pre-training, mbart

0

0

0

0

10:32