Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

16/11/2020

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

Ece Takmaz, Sandro Pezzelle, Lisa Beinborn, Raquel Fernández

Keywords: image process, language production, image generation, visual processing

Abstract Paper Similar Papers

Abstract: When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural---particularly when gaze is encoded with a dedicated recurrent component.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at EMNLP 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

16/11/2020

CapWAP: Image Captioning with a Purpose

Adam Fisch, Kenton Lee, Ming-Wei Chang and
Jonathan Clark, Regina Barzilay

Keywords Paper

image task, visual images, captioning, capwap

0

0

0

0

11:26

06/12/2020

Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning

Iro Laina, Ruth Fong, Andrea Vedaldi

Keywords Paper

Algorithms -> Image Segmentation; Algorithms -> Semi-Supervised Learning; Applications -> Computer Vision; Applications -> Imag, Algorithms -> Adversarial Learning

0

0

0

0

3:25

16/11/2020

Learning to Represent Image and Text with Denotation Graph

Bowen Zhang, Hexiang Hu, Vihan Jain and
Eugene Ie, Fei Sha

Keywords Paper

cross-modal retrieval, referring expression, compositional recognition, pre-training

0

0

0

0

10:59

14/06/2020

Learning to Observe: Approximating Human Perceptual Thresholds for Detection of Suprathreshold Image Transformations

Alan Dolhasz, Carlo Harvey, Ian Williams

Keywords Paper

percetpion, jnd, vision, deep learning, image compositing, local distortions, subjective quality

0

0

0

0

1:01

25/04/2020

Automatic Annotation Synchronizing with Textual Description for Visualization

Chufan Lai, Zhixian Lin, Ruike Jiang and
Yun Han, Can Liu, Xiaoru Yuan

Keywords Paper

visualization, annotation, natural language interface, machine learning

0

0

0

0

15:04

04/07/2020

Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words?

Cansu Sen, Thomas Hartvigsen, Biao Yin and
Xiangnan Kong, Elke Rundensteiner

Keywords Paper

Text Classification, quantitative mechanisms, text task, large-scale study

0

0

0

0

11:05

19/04/2021

Cognition-aware cognate detection

Diptesh Kanojia, Prashant Sharma, Sayali Ghodekar and
Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

Keywords Paper

0

0

0

0

8:53

04/07/2020

Cross-modal Coherence Modeling for Caption Generation

Malihe Alikhani, Piyush Sharma, Shengjie Li and
Radu Soricut, Matthew Stone

Keywords Paper

Caption Generation, image captioning, coherence prediction, Cross-modal Modeling

0

0

0

0

13:14

06/12/2021

Multi-modal Dependency Tree for Video Captioning

Wentian Zhao, Xinxiao Wu, Jiebo Luo

Keywords Paper

reinforcement learning and planning, graph learning, language

0

0

0

0

6:02

04/07/2020

Conversational Word Embedding for Retrieval-Based Dialog System

Wentao Ma, Yiming Cui, Ting Liu and
Dong Wang, Shijin Wang, Guoping Hu

Keywords Paper

Conversational Embedding, Retrieval-Based System, single-turn tasks, retrieval-based systems

0

0

0

0

6:52

19/04/2021

Interpretability for morphological inflection: From character-level predictions to subword-level rules

Tatyana Ruzsics, Olga Sozinova, Ximena Gutierrez-Vasques, Tanja Samardzic

Keywords Paper

0

0

0

0

10:53

02/02/2021

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng, Peng Gao, Moitreya Chatterjee and
Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Keywords Paper

0

0

0

0

19:36

03/08/2020

TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP

Nils Rethmeier, Vageesh Kumar Saxena, Isabelle Augenstein

Keywords Paper

0

0

0

0

7:32

30/11/2020

Second Order enhanced Multi-glimpse Attention in Visual Question Answering

Qiang Sun, Binghui Xie, Yanwei Fu

Keywords Paper

0

0

0

0

7:20

26/04/2020

Neural Machine Translation with Universal Visual Representation

Zhuosheng Zhang, Kehai Chen, Rui Wang and
Masao Utiyama, Eiichiro Sumita, Zuchao Li, Hai Zhao

Keywords Paper

Neural Machine Translation, Visual Representation, Multimodal Machine Translation, Language Representation

0

0

0

0

4:50

02/02/2021

Learning Visual Context for Group Activity Recognition

Hangjie Yuan, Dong Ni

Keywords Paper

0

0

0

0

16:54

06/12/2020

Learning Semantic-aware Normalization for Generative Adversarial Networks

Heliang Zheng, Jianlong Fu, zengyh Zeng and
Jiebo Luo, Zheng-Jun Zha

Keywords Paper

0

0

0

0

3:11

04/07/2020

Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning

Angeliki Lazaridou, Anna Potapenko, Olivier Tieleman

Keywords Paper

Multi-agent Communication, natural learning, visual task, Functional Learning

0

0

0

0

11:44

06/12/2021

End-to-end Multi-modal Video Temporal Grounding

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Keywords Paper

self-supervised learning, transformers, vision, contrastive learning

0

0

0

0

8:46

30/11/2020

Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information

Yiqing Huang, Jiansheng Chen

Keywords Paper

0

0

0

0

7:08

14/06/2020

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Hui Chen, Guiguang Ding, Xudong Liu and
Zijia Lin, Ji Liu, Jungong Han

Keywords Paper

cross-modal image text retrieval, iterative matching, recurrent attention memory

0

0

0

0

1:04

30/11/2020

Image Captioning through Image Transformer

Sen He, Wentong Liao, Hamed R. Tavakoli and
Michael Yang, Bodo Rosenhahn, Nicolas Pugeault

Keywords Paper

0

0

0

0

9:49

19/04/2021

L2C: Describing visual differences needs semantic understanding of individuals

An Yan, Xin Wang, Tsu-Jui Fu, William Yang Wang

Keywords Paper

0

0

0

0

5:12

06/12/2021

Looking Beyond Single Images for Contrastive Semantic Segmentation Learning

FEIHU ZHANG, Philip Torr, Rene Ranftl, Stephan Richter

Keywords Paper

machine learning, vision, contrastive learning, representation learning

0

0

0

0

14:48

06/12/2020

Dynamic Fusion of Eye Movement Data and Verbal Narrations in Knowledge-rich Domains

Zhan Shaw, Qi Yu, Rui Li and
Pengcheng Shi, Anne Haake

Keywords Paper

0

0

0

0

3:22

02/02/2021

A Simple and Effective Self-Supervised Contrastive Learning Framework for Aspect Detection

Tian Shi, Liuqing Li, Ping Wang, Chandan K. Reddy

Keywords Paper

0

0

0

0

19:21

08/12/2020

VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks

Caren Han, Siqu Long, Siwen Luo and
Kunze Wang, Josiah Poon

Keywords Paper

0

0

0

0

16:29

06/12/2021

Low-dimensional Structure in the Space of Language Representations is Reflected in Brain Responses

Richard Antonello, Javier S Turek, Vy Vo, Alexander Huth

Keywords Paper

vision, language, transfer learning

0

0

0

0

10:29

08/12/2020

Incremental Neural Lexical Coherence Modeling

Sungho Jeon, Michael Strube

Keywords Paper

0

0

0

0

9:08

19/04/2021

Modelling context emotions using multi-task learning for emotion controlled dialog generation

Deeksha Varshney, Asif Ekbal, Pushpak Bhattacharyya

Keywords Paper

0

0

0

0

9:50

14/06/2020

Vision-Dialog Navigation by Exploring Cross-Modal Memory

Yi Zhu, Fengda Zhu, Zhaohuan Zhan and
Bingqian Lin, Jianbin Jiao, Xiaojun Chang, Xiaodan Liang

Keywords Paper

vision-dialog navigation, cross-modal reasoning, memory network.

0

0

0

0

1:04

14/06/2020

Learning Representations by Predicting Bags of Visual Words

Spyros Gidaris, Andrei Bursuc, Nikos Komodakis and
Patrick Pérez, Matthieu Cord

Keywords Paper

representation learning, self-supervised learning, unsupervised learning, discrete representations, bag of visual words, image understanding, deep learning, convolutional neural networks

0

0

0

0

1:01

14/06/2020

ActBERT: Learning Global-Local Video-Text Representations

Linchao Zhu, Yi Yang

Keywords Paper

actbert, cross-modal pretraining, video and language, transformer, tangled transformer, instructional videos

0

0

0

0

4:58

06/12/2020

RANet: Region Attention Network for Semantic Segmentation

Dingguo Shen, Yuanfeng Ji, Ping Li and
Yi Wang, Di Lin

Keywords Paper

0

0

0

0

3:13

06/12/2021

MERLOT: Multimodal Neural Script Knowledge Models

Rowan Zellers, Ximing Lu, Jack Hessel and
Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

Keywords Paper

representation learning

0

0

0

0

18:15

19/08/2021

Text-based Person Search via Multi-Granularity Embedding Learning

Chengji Wang, Zhiming Luo, Yaojin Lin, Shaozi Li

Keywords Paper

Computer Vision, Language and Vision, Recognition

0

0

0

0

12:25

14/06/2020

Relation-Aware Global Attention for Person Re-Identification

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng and
Xin Jin, Zhibo Chen

Keywords Paper

relation-aware global attention, attention mechanism, person re-identification, feature relations, global structural information

0

0

0

0

1:01

14/06/2020

Interpretable and Accurate Fine-grained Recognition via Region Grouping

Zixuan Huang, Yin Li

Keywords Paper

interpretable deep model, fine-grained recognition, region-based recognition

0

0

0

0

4:58

04/07/2020

Representation Learning for Information Extraction from Form-like Documents

Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata and
James Bradley Wendt, Qi Zhao, Marc Najork

Keywords Paper

Information Extraction, extraction task, Representation Learning, extraction system

0

0

0

0

10:58

04/07/2020

exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models

Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann

Keywords Paper

analysis, model-internal process, exBERT, Visual Tool

0

0

0

0

9:44