VD-BERT: A Unified Vision and Dialog Transformer with BERT

16/11/2020

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Yue Wang, Shafiq Joty, Michael Lyu, Irwin King, Caiming Xiong, Steven C.H. Hoi

Keywords: visual dialog, vision-language task, visual tasks, answer ranking

Abstract Paper Similar Papers

Abstract: Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at EMNLP 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

18/07/2021

Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

Keywords Paper

Algorithms, Multimodal Learning

0

0

0

0

4:58

02/02/2021

UBAR: Towards Fully End-to-End Task-Oriented Dialog System with GPT-2

Yunyi Yang, Yunhao Li, Xiaojun Quan

Keywords Paper

0

0

0

0

19:38

04/07/2020

A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization

Dongfang Xu, Zeyu Zhang, Steven Bethard

Keywords Paper

Biomedical Normalization, Concept normalization, Generate-and-Rank Framework, Semantic Regularization

0

0

0

0

11:36

03/05/2021

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Boxin Wang, Shuohang Wang, Yu Cheng and
Zhe Gan, Ruoxi Jia, Bo Li, Jingjing Liu

Keywords Paper

adversarial training, QA, NLI, BERT, information theory, adversarial robustness

0

0

0

0

5:21

26/04/2020

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, Xizhou Zhu, Yue Cao and
Bin Li, Lewei Lu, Furu Wei, Jifeng Dai

Keywords Paper

Visual-Linguistic, Generic Representation, Pre-training

0

0

0

0

4:40

16/11/2020

ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention

Jose Manuel Gomez-Perez, Raúl Ortega

Keywords Paper

textbook answering, machine comprehension, visual answering, transformer models

0

0

0

0

11:52

06/12/2020

Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Alon Talmor, Oyvind Tafjord, Peter Clark and
Yoav Goldberg, Jonathan Berant

Keywords Paper

0

0

0

0

3:28

04/07/2020

Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition

Ryuichi Takanobu, Runze Liang, Minlie Huang

Keywords Paper

pretraining, Multi-Agent Learning, Role-Aware Decomposition, reinforcement learning

0

0

0

0

13:00

04/07/2020

Cross-Modality Relevance for Reasoning on Language and Vision

Chen Zheng, Quan Guo, Parisa Kordjamshidi

Keywords Paper

Cross-Modality Relevance, Language Vision, visual answering, VQA

0

0

0

0

10:59

19/04/2021

Modeling coreference relations in visual dialog

Mingxiao Li, Marie-Francine Moens

Keywords Paper

0

0

0

0

10:33

05/12/2020

A unified framework for multilingual and code-mixed visual question answering

Deepak Gupta, Pabitra Lenka, Asif Ekbal, Pushpak Bhattacharyya

Keywords Paper

0

0

0

0

11:48

14/06/2020

Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition

Canjie Luo, Yuanzhi Zhu, Lianwen Jin, Yongpan Wang

Keywords Paper

data augmentation, text recognition, joint training

0

0

0

0

0:59

02/02/2021

Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

Alexander Podolskiy, Dmitry Lipin, Andrey Bout and
Ekaterina Artemova, Irina Piontkovskaya

Keywords Paper

0

0

0

0

16:08

06/12/2021

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Muchen Li, Leonid Sigal

Keywords Paper

transformers, vision

0

0

0

0

7:54

02/02/2021

Exploring Auxiliary Reasoning Tasks for Task-oriented Dialog Systems with Meta Cooperative Learning

Bowen Qin, Min Yang, Lidong Bing and
Qingshan Jiang, Chengming Li, Ruifeng Xu

Keywords Paper

0

0

0

0

15:41

12/07/2020

Tuning-free Plug-and-Play Proximal Algorithm for Inverse Imaging Problems

Kaixuan Wei, Angelica I Aviles-Rivero, Jingwei Liang and
Ying Fu, Carola-Bibiane Schönlieb, Hua Huang

Keywords Paper

Deep Learning - Algorithms

0

0

0

0

11:48

02/02/2021

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng, Peng Gao, Moitreya Chatterjee and
Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

Keywords Paper

0

0

0

0

19:36

16/11/2020

A Diagnostic Study of Explainability Techniques for Text Classification

Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, Isabelle Augenstein

Keywords Paper

downstream tasks, machine learning, explainability techniques, diverse techniques

0

0

0

0

11:24

23/08/2020

Spectrum-guided adversarial disparity learning

Zhe Liu, Lina Yao, Lei Bai and
Xianzhi Wang, Can Wang

Keywords Paper

adversarial autoencoder, generative models, intraclass variability, activity recognition

0

0

0

0

14:30

02/02/2021

Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection and Slot Filling

Jixuan Wang, Kai Wei, Martin Radfar and
Weiwei Zhang, Clement Chung

Keywords Paper

0

0

0

0

19:31

14/06/2020

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang

Keywords Paper

computer vision, vision language navigation, reinforcement learning

0

0

0

0

4:25

16/11/2020

Learning to Represent Image and Text with Denotation Graph

Bowen Zhang, Hexiang Hu, Vihan Jain and
Eugene Ie, Fei Sha

Keywords Paper

cross-modal retrieval, referring expression, compositional recognition, pre-training

0

0

0

0

10:59

04/07/2020

Fluent Response Generation for Conversational Question Answering

Ashutosh Baheti, Alan Ritter, Kevin Small

Keywords Paper

Fluent Generation, Conversational Answering, Question answering, Question QA

1

0

0

0

11:29

04/07/2020

Learning Dialog Policies from Weak Demonstrations

Gabriel Gordon-Hall, Philip John Gorinski, Shay B. Cohen

Keywords Paper

Weak Demonstrations, dialog manager, multi-domain systems, expert demonstrators

0

0

0

0

11:14

14/06/2020

GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping

Hao-Shu Fang, Chenxi Wang, Minghao Gou, Cewu Lu

Keywords Paper

robotics, grasping, 6d pose, grasp pose, manipulation, dataset, pick and place, bin picking

0

0

0

0

1:01

16/11/2020

Towards Robotic Assembly by Predicting Robust, Precise and Task-oriented Grasps

Jialiang Zhao, Daniel Troniak, Oliver Kroemer

Keywords Paper

0

0

0

0

5:02

26/04/2020

Composing Task-Agnostic Policies with Deep Reinforcement Learning

Ahmed H. Qureshi, Jacob J. Johnson, Yuzhe Qin and
Taylor Henderson, Byron Boots, Michael C. Yip

Keywords Paper

composition, transfer learning, deep reinforcement learning

0

0

0

0

4:57

14/06/2020

Cascaded Human-Object Interaction Recognition

Tianfei Zhou, Wenguan Wang, Siyuan Qi and
Haibin Ling, Jianbing Shen

Keywords Paper

human-object interaction recognition, cascade reasoning, fine-grained relation segmentation

0

0

0

0

1:01

06/12/2020

Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang, Linli Xu and
Hao-Ran Wei, Boxing Chen, Enhong Chen

Keywords Paper

0

0

0

0

3:17

04/07/2020

Conversation Learner - A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems

Swadheen Shukla, Lars Liden, Shahin Shayandeh and
Eslam Kamal, Jinchao Li, Matt Mazzola, Thomas Park, Baolin Peng, Jianfeng Gao

Keywords Paper

Task-Oriented Systems, dialog managers, Dialog Managers, task-oriented system

0

0

0

0

10:15

14/06/2020

Reinforced Feature Points: Optimizing Feature Detection and Description for a High-Level Task

Aritra Bhowmik, Stefan Gumhold, Carsten Rother, Eric Brachmann

Keywords Paper

sparse features, reinforcement learning, key point detection, feature description, feature matching, relative pose estimation, ransac, essential matrix, sift, superpoint

0

0

0

0

5:01

02/02/2021

MARTA: Leveraging Human Rationales for Explainable Text Classification

Ines Arous, Ljiljana Dolamic, Jie Yang and
Akansha Bhardwaj, Giuseppe Cuccu, Philippe Cudré-Mauroux

Keywords Paper

0

0

0

0

16:43

14/06/2020

Attention-Based Context Aware Reasoning for Situation Recognition

Thilini Cooray, Ngai-Man Cheung, Wei Lu

Keywords Paper

situation recognition, visual semantic role labelling, scene understanding, vision and language, action recognition

0

0

0

0

1:00

06/12/2021

Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation

Lin Guan, Mudit Verma, Suna (Sihang) Guo and
Ruohan Zhang, Subbarao Kambhampati

Keywords Paper

reinforcement learning and planning, machine learning

0

0

0

0

13:41

14/06/2020

Meshed-Memory Transformer for Image Captioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

Keywords Paper

transformer, image captioning, vision and language, fully-attentive models, mesh connectivity, memory vectors, self-attention

0

0

0

0

1:00

16/11/2020

Sub-Instruction Aware Vision-and-Language Navigation

Yicong Hong, Cristian Rodriguez, Qi Wu, Stephen Gould

Keywords Paper

vision-and-language navigation, navigation, agent, sub-instruction modules

0

0

0

0

9:21

06/12/2021

Searching the Search Space of Vision Transformer

Minghao Chen, Kan Wu, Bolin Ni and
Houwen Peng, Bei Liu, Jianlong Fu, Hongyang Chao, Haibin Ling

Keywords Paper

deep learning, transformers, vision

0

0

0

0

10:05

19/08/2021

UIBert: Learning Generic Multimodal Representations for UI Understanding

Chongyang Bai, Xiaoxue Zang, Ying Xu and
Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, Blaise Agüera y Arcas

Keywords Paper

Humans and AI, Human-Computer Interaction

0

0

0

0

14:42

19/04/2021

Alternating recurrent dialog model with large-scale pre-trained language models

Qingyang Wu, Yichi Zhang, Yu Li, Zhou Yu

Keywords Paper

0

0

0

0

11:29

19/10/2020

AutoADR: Automatic model design for ad relevance

Yiren Chen, Yaming Yang, Hong Sun and
Yujing Wang, Yu Xu, Wei Shen, Rong Zhou, Yunhai Tong, Jing Bai, Ruofei Zhang

Keywords Paper

neural architecture search, knowledge distillation, ad relevance

0

0

0

0

9:24