A hierarchical approach to vision-based language generation: from simple sentences to complex natural language

08/12/2020

A hierarchical approach to vision-based language generation: from simple sentences to complex natural language

Simion-Vlad Bogolin, Ioana Croitoru, Marius Leordeanu

Keywords:

Abstract Paper Similar Papers

Abstract: Automatically describing videos in natural language is an ambitious problem, which could bridge our understanding of vision and language. We propose a hierarchical approach, by first generating video descriptions as sequences of simple sentences, followed at the next level by a more complex and fluent description in natural language. While the simple sentences describe simple actions in the form of (subject, verb, object), the second-level paragraph descriptions, indirectly using information from the first-level description, presents the visual content in a more compact, coherent and semantically rich manner. To this end, we introduce the first video dataset in the literature that is annotated with captions at two levels of linguistic complexity. We perform extensive tests that demonstrate that our hierarchical linguistic representation, from simple to complex language, allows us to train a two-stage network that is able to generate significantly more complex paragraphs than current one-stage approaches.

The video of this talk cannot be embedded. You can watch it here:

https://underline.io/lecture/6221-a-hierarchical-approach-to-vision-based-language-generation-from-simple-sentences-to-complex-natural-language

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at COLING 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

14/06/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Thao Minh Le, Vuong Le, Svetha Venkatesh, Truyen Tran

Keywords Paper

video question answering, visual question answering, conditional relation network, vision-language neural network

0

0

0

0

5:00

26/04/2020

Neural Machine Translation with Universal Visual Representation

Zhuosheng Zhang, Kehai Chen, Rui Wang and
Masao Utiyama, Eiichiro Sumita, Zuchao Li, Hai Zhao

Keywords Paper

Neural Machine Translation, Visual Representation, Multimodal Machine Translation, Language Representation

0

0

0

0

4:50

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

04/07/2020

Video-Grounded Dialogues with Pretrained Generation Language Models

Hung Le, Steven C.H. Hoi

Keywords Paper

downstream tasks, video-grounded tasks, sequence-to-sequence task, Pretrained Models

0

0

0

0

7:22

07/09/2020

Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Yang Wang

Keywords Paper

video thumbnail generation, conditional normalization

0

0

0

0

7:40

06/12/2020

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider and
Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

Keywords Paper

1

0

0

0

3:25

06/12/2021

CLIP-It! Language-Guided Video Summarization

Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

Keywords Paper

transformers

0

0

0

0

6:14

02/02/2021

Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Ke Lin, Zhuoxin Gan, Liwei Wang

Keywords Paper

0

0

0

0

16:57

02/02/2021

SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

Ting Yao, Yiheng Zhang, Zhaofan Qiu and
Yingwei Pan, Tao Mei

Keywords Paper

0

0

0

0

16:17

26/04/2020

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova

Keywords Paper

video representation learning, video understanding, activity recognition, neural architecture search

0

0

0

0

5:02

06/12/2021

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Keywords Paper

self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

8:46

02/02/2021

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Yang Fu, Linjie Yang, Ding Liu and
Thomas S. Huang, Humphrey Shi

Keywords Paper

0

0

0

0

16:24

02/02/2021

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Lincheng Li, Suzhen Wang, Zhimeng Zhang and
Yu Ding, Yixing Zheng, Xin Yu, Changjie Fan

Keywords Paper

0

0

0

0

15:58

05/01/2021

High-Quality Frame Interpolation via Tridirectional Inference

Jinsoo Choi, Jaesik Park, In So Kweon

Keywords Paper

0

0

0

0

4:08

07/09/2020

Multimodal Image Translation with Stochastic Style Representations and Mutual Information Loss

Sanghyeon Na, Seungjoo Yoo, Jaegul Choo

Keywords Paper

image-to-image translation, generative adversarial network

0

0

0

0

9:52

22/11/2021

Hierarchical Contrastive Motion Learning for Video Action Recognition

Xitong Yang, Xiaodong Yang, Sifei Liu and
Deqing Sun, Larry Davis, Jan Kautz

Keywords Paper

action recognition, motion hierarchy, motion representation, contrastive learning

0

0

0

0

8:29

16/11/2020

Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

Jack Hessel, Zhenhai Zhu, Bo Pang, Radu Soricut

Keywords Paper

pretraining, video tasks, pretraining model, features

0

0

0

0

7:00

22/11/2021

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Zhengwei Wang, Qi She, Aljosa Smolic

Keywords Paper

video action recognition, partially decoded video, multi-modal fusion

0

0

0

0

3:24

19/04/2021

MONAH: Multi-modal narratives for humans to analyze conversations

Joshua Y. Kim, Kalina Yacef, Greyson Kim and
Chunfeng Liu, Rafael Calvo, Silas Taylor

Keywords Paper

0

0

0

0

11:51

02/02/2021

Proposal-Free Video Grounding with Contextual Pyramid Network

Kun Li, Dan Guo, Meng Wang

Keywords Paper

0

0

0

0

14:19

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

02/02/2021

Object Relation Attention for Image Paragraph Captioning

Li-Chuan Yang, Chih-Yuan Yang, Jane Yung-jen Hsu

Keywords Paper

0

0

0

0

15:03

22/11/2021

Learning to Deblur and Rotate Motion-Blurred Faces

Givi Meishvili, Attila Szabo, Simon Jenni, Paolo Favaro

Keywords Paper

deblurring, face, multi-view, video, blur, GAN, novel view synthesis, inversion, deep learning, dataset

0

0

0

0

2:50

07/09/2020

Tripping through time: Efficient Localization of Activities in Videos

Meera Hahn, Asim Kadav, James Rehg, Hans Peter Graf

Keywords Paper

Activity Localization, Reinforcement learning, Vision and Language

0

0

0

0

10:10

14/06/2020

Interpreting the Latent Space of GANs for Semantic Face Editing

Yujun Shen, Jinjin Gu, Xiaoou Tang, Bolei Zhou

Keywords Paper

generative adversarial network, network interpretation, face editing

0

0

0

0

1:01

04/07/2020

Improving Image Captioning with Better Use of Caption

Zhan Shi, Xu Zhou, Xipeng Qiu, Xiaodan Zhu

Keywords Paper

Image Captioning, multimodal problem, natural processing, computer community

0

0

0

0

11:11

30/11/2020

Multi-Task Learning for Simultaneous Video Generation and Remote Photoplethysmography Estimation

Yun-Yun Tsou, Yi-An Lee, Chiou-Ting Hsu

Keywords Paper

0

0

0

0

9:53

05/01/2021

DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video

Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando and
Hongdong Li, Stephen Gould

Keywords Paper

0

0

0

0

5:02

05/01/2021

Faces a la Carte: Text-to-Face Generation via Attribute Disentanglement

Tianren Wang, Teng Zhang, Brian Lovell

Keywords Paper

0

0

0

0

4:33

06/12/2020

Network-to-Network Translation with Conditional Invertible Neural Networks

Robin Rombach, Patrick Esser, Bjorn Ommer

Keywords Paper

0

0

0

0

3:25

03/05/2021

Self-Supervised Learning of Compressed Video Representations

Youngjae Yu, Sangho Lee, Gunhee Kim, Yale Song

Keywords Paper

self-supervised learning, Compressed videos

0

0

0

0

4:34

19/04/2021

DOCENT: Learning self-supervised entity representations from large document collections

Yury Zemlyanskiy, Sudeep Gandhe, Ruining He and
Bhargav Kanagal, Anirudh Ravula, Juraj Gottweis, Fei Sha, Ilya Eckstein

Keywords Paper

0

0

0

0

6:37

02/02/2021

Translate the Facial Regions You Like Using Self-Adaptive Region Translation

Wenshuang Liu, Wenting Chen, Zhanjia Yang, Linlin Shen

Keywords Paper

0

0

0

0

14:19

14/06/2020

Single-View View Synthesis With Multiplane Images

Richard Tucker, Noah Snavely

Keywords Paper

view synthesis, monocular, multiplane image, image-based rendering, 3d deep learning, scale invariance

0

0

0

0

1:01

14/06/2020

Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume

Adrian Johnston, Gustavo Carneiro

Keywords Paper

self-supervised depth estimation, self-supervised learning, self-attention, depth estimation, uncertainty

0

0

0

0

1:01

16/11/2020

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

Alexander Ku, Peter Anderson, Roma Patel and
Eugene Ie, Jason Baldridge

Keywords Paper

multitask learning, embodied agents, vln, rxr

0

0

0

0

13:05

02/02/2021

Activity Image-to-Video Retrieval by Disentangling Appearance and Motion

Liu Liu, Jiangtong Li, Li Niu and
Ruicong Xu, Liqing Zhang

Keywords Paper

0

0

0

1

14:34

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

16/11/2020

Multi-Stage Pre-training for Low-Resource Domain Adaptation

Rong Zhang, Revanth Gangi Reddy, Md Arafat Sultan and
Vittorio Castelli, Anthony Ferritto, Radu Florian, Efsun Sarioglu Kayi, Salim Roukos, Avi Sil, Todd Ward

Keywords Paper

nlp tasks, fine-tuning, auxiliary tasks, lm transfer

0

0

0

0

6:56

14/06/2020

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

Cheng-Han Lee, Ziwei Liu, Lingyun Wu, Ping Luo

Keywords Paper

facial image manipulation, face segmentation, image synthesis, generative adversarial network

0

0

0

0

1:00