Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

02/02/2021

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang, Renda Bao, Qi Wu, Si Liu

Keywords:

Abstract Paper Similar Papers

Abstract: When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

The video of this talk cannot be embedded. You can watch it here:

https://slideslive.com/38947895

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at AAAI 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

16/11/2020

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Milan Aggarwal, Hiresh Gupta, Mausoom Sarkar, Balaji Krishnamurthy

Keywords Paper

document extraction, semantic task, image resolution, structure extraction

0

0

0

0

11:26

16/11/2020

Multi-resolution Annotations for Emoji Prediction

Weicheng Ma, Ruibo Liu, Lili Wang, Soroush Vosoughi

Keywords Paper

natural tasks, emojis, linguistic components, multi-class setting

0

0

0

0

11:52

14/06/2020

ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection

Yuxin Wang, Hongtao Xie, Zheng-Jun Zha and
Mengting Xing, Zilong Fu, Yongdong Zhang

Keywords Paper

scene text detection, arbitrary shapes, false-positive suppression, large scale variance

0

0

0

0

1:01

19/04/2021

StructSum: Summarization via structured representations

Vidhisha Balachandran, Artidoro Pagnoni, Jay Yoon Lee and
Dheeraj Rajagopal, Jaime Carbonell, Yulia Tsvetkov

Keywords Paper

0

0

0

0

6:32

16/11/2020

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Linyang Li, Ruotian Ma, Qipeng Guo and
Xiangyang Xue, Xipeng Qiu

Keywords Paper

adversarial attacks, downstream tasks, calculation, gradient-based methods

0

0

0

0

11:36

02/02/2021

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Keywords Paper

0

0

0

0

15:58

19/08/2021

Correlation-Guided Representation for Multi-Label Text Classification

Qian-Wen Zhang, Ximing Zhang, Zhao Yan and
Ruifang Liu, Yunbo Cao, Min-Ling Zhang

Keywords Paper

Machine Learning, Multi-instance; Multi-label; Multi-view learning, Classification, Text Classification

0

0

0

0

11:13

02/02/2021

MASKER: Masked Keyword Regularization for Reliable Text Classification

Seung Jun Moon, Sangwoo Mo, Kimin Lee and
Jaeho Lee, Jinwoo Shin

Keywords Paper

0

0

0

0

15:05

05/01/2021

Line Art Correlation Matching Feature Transfer Network for Automatic Animation Colorization

Qian Zhang, Bo Wang, Wei Wen and
Hai Li, Junhui Liu

Keywords Paper

0

0

0

0

4:47

16/11/2020

A Simple and Effective Model for Answering Multi-span Questions

Elad Segal, Avia Efrat, Mor Shoham and
Amir Globerson, Jonathan Berant

Keywords Paper

reading comprehension, rc, learning problem, sequence problem

0

0

0

0

7:06

02/02/2021

Non-Autoregressive Coarse-to-Fine Video Captioning

Bang Yang, Yuexian Zou, Fenglin Liu, Can Zhang

Keywords Paper

0

0

0

0

18:21

12/07/2020

On Variational Learning of Controllable Representations for Text without Supervision

Peng Xu, Jackie Chi Kit Cheung, Yanshuai Cao

Keywords Paper

Representation Learning

0

0

0

0

14:51

14/06/2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Difei Gao, Ke Li, Ruiping Wang and
Shiguang Shan, Xilin Chen

Keywords Paper

visual question answering, graph neural network, scene text understanding, vision and language learning, multi-modal information

0

0

0

0

1:01

04/07/2020

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Esin Durmus, He He, Mona Diab

Keywords Paper

Faithfulness Assessment, Abstractive Summarization, evaluating summary, reading comprehension

0

0

0

1

12:13

16/11/2020

Simultaneous Machine Translation with Visual Context

Ozan Caglayan, Julia Ive, Veneta Haralampieva and
Pranava Madhyastha, Loïc Barrault, Lucia Specia

Keywords Paper

simt, multimodal approaches, simt frameworks, visually-grounded models

0

0

0

0

12:34

19/08/2021

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

Wei Suo, MengYang Sun, Peng Wang, Qi Wu

Keywords Paper

Computer Vision, Language and Vision, Structural and Model-Based Approaches, Knowledge Representation and Reasoning

0

0

0

0

17:31

06/12/2021

BARTScore: Evaluating Generated Text as Text Generation

Weizhe Yuan, Graham Neubig, Pengfei Liu

Keywords Paper

0

0

0

0

13:47

14/06/2020

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Keywords Paper

textvqa, visual question answering, vqa, vision and language, st-vqa, ocr-vqa, transformer, pointer network, ocr

0

0

0

0

4:56

01/07/2020

Neural Multi-task Text Normalization and Sanitization with Pointer-Generator

Hoang Nguyen, Sandro Cavallari

Keywords Paper

0

0

0

0

9:16

16/11/2020

Plan ahead: Self-Supervised Text Planning for Paragraph Completion Task

Dongyeop Kang, Eduard Hovy

Keywords Paper

nlp tasks, guiding realization, paragraph task, content prediction

0

0

0

0

12:13

22/11/2021

Spatial Aggregation for Scene Text Recognition

Yili Huang, Chengyu Gu, Shilin Wang and
Zheng Huang, Kai Chen

Keywords Paper

Scene text recognition, Vocabulary reliance, Spatial aggregation

0

0

0

0

2:56

02/02/2021

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

Yehao Li, Yingwei Pan, Ting Yao and
Jingwen Chen, Tao Mei

Keywords Paper

0

0

0

0

15:34

02/02/2021

A Bidirectional Multi-paragraph Reading Model for Zero-shot Entity Linking

Hongyin Tang, Xingwu Sun, Beihong Jin, Fuzheng Zhang

Keywords Paper

0

0

0

0

14:37

16/11/2020

Multilevel Text Alignment with Cross-Document Attention

Xuhui Zhou, Nikolaos Pappas, Noah A. Smith

Keywords Paper

text alignment, citation recommendation, plagiarism detection, predicting relationships

0

0

0

0

11:45

06/12/2021

A Multi-Implicit Neural Representation for Fonts

Pradyumna Reddy, Zhifei Zhang, Matthew Fisher and
Hailin Jin, Zhaowen Wang, Niloy Mitra

Keywords Paper

deep learning, representation learning

0

0

0

0

8:42

19/04/2021

Removing word-level spurious alignment between images and pseudo-captions in unsupervised image captioning

Ukyo Honda, Yoshitaka Ushiku, Atsushi Hashimoto and
Taro Watanabe, Yuji Matsumoto

Keywords Paper

0

0

0

0

12:30

16/11/2020

Topic Modeling in Embedding Spaces

Adji Bousso Dieng, Francisco Ruiz, David Blei

Keywords Paper

generative documents, topic modeling, topic models, embedded model

0

0

0

0

12:46

04/07/2020

Words Aren't Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions

Arjun Akula, Spandana Gella, Yaser Al-Onaizan and
Song-Chun Zhu, Siva Reddy

Keywords Paper

Robustness Expressions, Grounding Expressions, Visual recognition, natural understanding

0

0

0

0

6:53

04/07/2020

Enabling Language Models to Fill in the Blanks

Chris Donahue, Mina Lee, Percy Liang

Keywords Paper

text infilling, predicting text, writing tools, language modeling

0

0

0

0

7:01

04/07/2020

LINSPECTOR: Multilingual Probing Tasks for Word Representations

Gözde Gül Sahin, Clara Vania, Ilia Kuznetsov, Iryna Gurevych

Keywords Paper

Word Representations, NLP, classification tasks, probing tasks

0

0

0

0

11:51

04/07/2020

Text Classification with Negative Supervision

Sora Ohashi, Junya Takayama, Tomoyuki Kajiwara and
Chenhui Chu, Yuki Arase

Keywords Paper

Text Classification, text representation, text tasks, single- classifications

0

0

0

0

6:27

06/12/2020

Prophet Attention: Predicting Attention with Future Attention

Fenglin Liu, Xuancheng Ren, Xian Wu and
Shen Ge, Wei Fan, Yuexian Zou, Xu Sun

Keywords Paper

0

0

0

1

3:25

19/08/2021

Deep Unified Cross-Modality Hashing by Pairwise Data Alignment

Yimu Wang, Bo Xue, Quan Cheng and
Yuhui Chen, Lijun Zhang

Keywords Paper

Computer Vision, Recognition, Information Retrieval, Deep Learning

0

0

0

0

13:11

26/04/2020

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

Sumanth Dathathri, Andrea Madotto, Janice Lan and
Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu

Keywords Paper

controlled text generation, generative models, conditional generative models, language modeling, transformer

0

0

1

1

4:58

08/12/2020

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Wei Han, Hantao Huang, Tao Han

Keywords Paper

0

0

0

0

9:44

04/07/2020

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Arman Cohan, Sergey Feldman, Iz Beltagy and
Doug Downey, Daniel Weld

Keywords Paper

Document-level Learning, Representation learning, natural systems, classification

0

0

0

0

13:07

19/04/2021

Handling out-of-vocabulary problem in hangeul word embeddings

Ohjoon Kwon, Dohyun Kim, Soo-Ryeon Lee and
Junyoung Choi, SangKeun Lee

Keywords Paper

0

0

0

0

8:54

04/07/2020

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

Po-Yao Huang, Junjie Hu, Xiaojun Chang, Alexander Hauptmann

Keywords Paper

Unsupervised Translation, Unsupervised MT, MT, alignment

0

0

0

0

12:17

12/07/2020

Educating Text Autoencoders: Latent Representation Guidance via Denoising

Tianxiao Shen, Jonas Mueller, Regina Barzilay, Tommi Jaakkola

Keywords Paper

Deep Learning - Generative Models and Autoencoders

0

0

0

0

17:06

04/07/2020

Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension

Hongyu Gong, Yelong Shen, Dian Yu and
Jianshu Chen, Dong Yu

Keywords Paper

Long-Text Comprehension, machine comprehension, MRC, question answering

0

0

0

0

11:25