Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

08/12/2020

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Wei Han, Hantao Huang, Tao Han

Keywords:

Abstract Paper Similar Papers

Abstract: Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

The video of this talk cannot be embedded. You can watch it here:

https://underline.io/lecture/6212-finding-the-evidence-localization-aware-answer-prediction-for-text-visual-question-answering

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at COLING 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

14/06/2020

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Keywords Paper

textvqa, visual question answering, vqa, vision and language, st-vqa, ocr-vqa, transformer, pointer network, ocr

0

0

0

0

4:56

19/04/2021

‘just because you are right, doesn’t mean I am wrong’: Overcoming a bottleneck in development and evaluation of open-ended VQA tasks

Man Luo, Shailaja Keyur Sampat, Riley Tallman and
Yankai Zeng, Manuha Vancha, Akarshan Sajja, Chitta Baral

Keywords Paper

0

0

0

0

7:10

30/11/2020

Image Captioning through Image Transformer

Sen He, Wentong Liao, Hamed R. Tavakoli and
Michael Yang, Bodo Rosenhahn, Nicolas Pugeault

Keywords Paper

0

0

0

0

9:49

19/08/2021

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

Wei Suo, MengYang Sun, Peng Wang, Qi Wu

Keywords Paper

Computer Vision, Language and Vision, Structural and Model-Based Approaches, Knowledge Representation and Reasoning

0

0

0

0

17:31

19/08/2021

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Zhong Ji, Kexin Chen, Haoran Wang

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

6:07

14/06/2020

RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge

Jun Cheng, Fuxiang Wu, Yanling Tian and
Lei Wang, Dapeng Tao

Keywords Paper

image synthesis, self-attentional embedding mixture, multi-captions, limited information, caption matching

0

0

0

0

1:01

02/02/2021

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Qi Zhu, Chenyu Gao, Peng Wang, Qi Wu

Keywords Paper

0

0

0

0

15:58

19/10/2020

Event-driven network for cross-modal retrieval

Zhixiong Zeng, Nan Xu, Wenji Mao

Keywords Paper

cross-modal retrieval, event embedding, text representation

0

0

0

0

5:59

02/02/2021

Object Relation Attention for Image Paragraph Captioning

Li-Chuan Yang, Chih-Yuan Yang, Jane Yung-jen Hsu

Keywords Paper

0

0

0

0

15:03

22/11/2021

Duplicate Latent Representation Suppression for Multi-object Variational Autoencoders

Li Nanbo, Robert B Fisher

Keywords Paper

object-centric representation learning, variational autoencoders, scene representation

0

0

0

0

2:58

04/07/2020

Span-based Localizing Network for Natural Language Video Localization

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou

Keywords Paper

Natural Localization, NLVL, ranking task, regression task

0

0

0

0

11:23

14/06/2020

Exploring Categorical Regularization for Domain Adaptive Object Detection

Chang-Dong Xu, Xing-Ran Zhao, Xin Jin, Xiu-Shen Wei

Keywords Paper

domain adaptive object detection, image-level categorical regularization, categorical consistency regularization, domain adaptive faster r-cnn

0

0

0

0

1:00

14/06/2020

Assessing Image Quality Issues for Real-World Problems

Tai-Yin Chiu, Yinan Zhao, Danna Gurari

Keywords Paper

image quality assessment, image recognition, image captioning, visual question answering, accessibility, blind people

0

0

0

0

0:55

16/11/2020

Multi-hop Inference for Question-driven Summarization

Yang Deng, Wenxuan Zhang, Wai Lam

Keywords Paper

question-driven summarization, question-driven method, multi-hop generator, multi-hop

0

0

0

0

13:22

30/11/2020

Accurate Arbitrary-Shaped Scene Text Detection via Iterative Polynomial Parameter Regression

Jiahao Shi, Long Chen, Feng Su

Keywords Paper

0

0

0

0

8:20

02/02/2021

Open Domain Dialogue Generation with Latent Images

Ze Yang, Wei Wu, Huang Hu and
Can Xu, Wei Wang, Zhoujun Li

Keywords Paper

0

0

0

0

15:13

02/02/2021

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang, Renda Bao, Qi Wu, Si Liu

Keywords Paper

0

0

0

0

15:04

14/06/2020

Learning Human-Object Interaction Detection Using Interaction Points

Tiancai Wang, Tong Yang, Martin Danelljan and
Fahad Shahbaz Khan, Xiangyu Zhang, Jian Sun

Keywords Paper

human-object interaction, interaction point, interaction grouping, keypoint detection

0

0

0

0

0:58

14/06/2020

Vec2Face: Unveil Human Faces From Their Blackbox Features in Face Recognition

Chi Nhan Duong, Thanh-Dat Truong, Khoa Luu and
Kha Gia Quach, Hung Bui, Kaushik Roy

Keywords Paper

generative models, bijective metric learning, blackbox face matcher, distillation framework, face synthesis, id preservation, feature-conditional structure, feature reconstruction, dibigan.

0

0

0

0

5:03

05/01/2021

TranstextNet: Transducing Text for Recognizing Unseen Visual Relationships

Gal S. Kenigsfield, Ran El-Yaniv

Keywords Paper

0

0

0

0

5:00

14/06/2020

Probability Weighted Compact Feature for Domain Adaptive Retrieval

Fuxiang Huang, Lei Zhang, Yang Yang, Xichuan Zhou

Keywords Paper

domain adaptive retrieval, bayesian formulation, learning to hash, transfer learning, focal-triplet loss, histogram feature of neighbors

0

0

0

0

1:03

14/06/2020

ContourNet: Taking a Further Step Toward Accurate Arbitrary-Shaped Scene Text Detection

Yuxin Wang, Hongtao Xie, Zheng-Jun Zha and
Mengting Xing, Zilong Fu, Yongdong Zhang

Keywords Paper

scene text detection, arbitrary shapes, false-positive suppression, large scale variance

0

0

0

0

1:01

14/06/2020

Attention-Based Context Aware Reasoning for Situation Recognition

Thilini Cooray, Ngai-Man Cheung, Wei Lu

Keywords Paper

situation recognition, visual semantic role labelling, scene understanding, vision and language, action recognition

0

0

0

0

1:00

04/07/2020

Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder

Daya Guo, Duyu Tang, Nan Duan and
Jian Yin, Daxin Jiang, Ming Zhou

Keywords Paper

Evidence-Aware Generation, Generating texts, generation, generation texts

0

0

0

0

10:51

14/06/2020

Image Search With Text Feedback by Visiolinguistic Attention Learning

Yanbei Chen, Shaogang Gong, Loris Bazzani

Keywords Paper

vision and language, image search, text feedback, attention mechanism, transformer, multimodal learning, representation learning, composition, image retrieval, interactive image search

0

0

0

0

1:00

05/01/2021

Regional Attention Networks With Context-Aware Fusion for Group Emotion Recognition

Ahmed Shehab Khan, Zhiyuan Li, Jie Cai, Yan Tong

Keywords Paper

0

0

0

0

5:00

16/11/2020

CapWAP: Image Captioning with a Purpose

Adam Fisch, Kenton Lee, Ming-Wei Chang and
Jonathan Clark, Regina Barzilay

Keywords Paper

image task, visual images, captioning, capwap

0

0

0

0

11:26

08/12/2020

Towards automatically generating Questions under Discussion to link information and discourse structure

Kordula De Kuthy, Madeeswaran Kannan, Haemanth Santhi Ponnusamy, Detmar Meurers

Keywords Paper

0

0

0

0

13:46

02/02/2021

Image Captioning with Context-Aware Auxiliary Guidance

Zeliang Song, Xiaofei Zhou, Zhendong Mao, Jianlong Tan

Keywords Paper

0

0

0

0

16:05

30/11/2020

Jointly Discriminating and Frequent Visual Representation Mining

Qiannan Wang, Ying Zhou, ZhaoYan Zhu and
Xuefeng Liang, Yu Gu

Keywords Paper

0

0

0

0

8:13

30/11/2020

Learning More Accurate Features for Semantic Segmentation in CycleNet

Linzi Qu, Lihuo He, JunJie Ke and
Xinbo Gao, Wen Lu

Keywords Paper

0

0

0

0

6:18

14/06/2020

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Xinyu Wang, Yuliang Liu, Chunhua Shen and
Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, Liangwei Wang

Keywords Paper

visual question answering, scene text, ocr

0

0

0

0

1:01

02/02/2021

News Content Completion with Location-Aware Image Selection

Zhengkun Zhang, Jun Wang, Adam Jatowt and
Zhe Sun, Shao-Ping Lu, Zhenglu Yang

Keywords Paper

0

0

0

0

17:50

02/02/2021

MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces

Linhai Zhang, Deyu Zhou, Yulan He, Zeng Yang

Keywords Paper

0

0

0

0

13:57

14/06/2020

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah

Keywords Paper

segmentation, localization, video, capsule, natural language, action, a2d, routing

0

0

0

0

4:58

04/07/2020

Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic Inference

Jing Wang, Mayank Kulkarni, Daniel Preotiuc-Pietro

Keywords Paper

Multi-Domain Recognition, Named recognition, domain models, NER

0

0

0

0

11:46

30/11/2020

MLIFeat: Multi-level information fusion based deep local features

Yuyang Zhang Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences and
Jinge Wang, Shibiao Xu, Xiao Liu, Xiaopeng Zhang

Keywords Paper

0

0

0

0

5:28

30/11/2020

Rotation Axis Focused Attention Network (RAFA-Net) for Estimating Head Pose

Ardhendu Behera, Zachary Wharton, Pradeep Hewage, Swagat Kumar

Keywords Paper

0

0

0

0

10:19

02/02/2021

Region-aware Global Context Modeling for Automatic Nerve Segmentation from Ultrasound Images

Huisi Wu, Jiasheng Liu, Wei Wang and
Zhenkun Wen, Jing Qin

Keywords Paper

0

0

0

0

15:15

05/12/2020

Point-of-interest oriented question answering with joint inference of semantic matching and distance correlation

Yifei Yuan, Jingbo Zhou, Wai Lam

Keywords Paper

0

0

0

0

13:14