VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

02/02/2021

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang, Jianfeng Gao, Lijuan Wang, Zicheng Liu

Keywords:

Abstract Paper Similar Papers

Abstract: It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pre-training (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

The video of this talk cannot be embedded. You can watch it here:

https://slideslive.com/38948936

(Link will open in new window)

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at AAAI 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

05/01/2021

Reducing the Annotation Effort for Video Object Segmentation Datasets

Paul Voigtlaender, Lishu Luo, Chun Yuan and
Yong Jiang, Bastian Leibe

Keywords Paper

0

0

0

0

5:01

19/08/2021

Leveraging Human Attention in Novel Object Captioning

Xianyu Chen, Ming Jiang, Qi Zhao

Keywords Paper

Computer Vision, Language and Vision

0

0

0

0

11:17

30/11/2020

Few-Shot Zero-Shot Learning: Knowledge Transfer with Less Supervision

Nanyi Fei, Jiechao Guan, Zhiwu Lu, Yizhao Gao

Keywords Paper

0

0

0

0

7:37

14/06/2020

Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation

Runfa Chen, Wenbing Huang, Binghui Huang and
Fuchun Sun, Bin Fang

Keywords Paper

nice-gan, reusing discriminators for encoding, unsupervised image-to-image translation, decoupled training, multi-scale discriminators, adversarial loss, no independent component for encoding, shared layers, residual attention, cyclegan

0

0

0

0

1:01

06/12/2021

Independent Prototype Propagation for Zero-Shot Compositionality

Frank Ruis, Gertjan Burghouts, Doina Bucur

Keywords Paper

deep learning, machine learning, vision, graph learning

0

0

0

0

4:18

06/12/2020

Compositional Zero-Shot Learning via Fine-Grained Dense Feature Composition

Dat Huynh, Ehsan Elhamifar

Keywords Paper

Algorithms -> Classification; Algorithms -> Meta-Learning; Applications -> Object Recognition, Algorithms -> Semi-Supervised Learning

0

0

0

0

3:24

06/12/2021

Few-Shot Segmentation via Cycle-Consistent Transformer

Gengwei Zhang, Guoliang Kang, Yi Yang, Yunchao Wei

Keywords Paper

transformers, vision, few shot learning

0

0

0

0

11:58

06/12/2021

Techniques for Symbol Grounding with SATNet

Sever Topan, David Rolnick, Xujie Si

Keywords Paper

deep learning

0

0

0

0

14:37

02/02/2021

Shape-Pose Ambiguity in Learning 3D Reconstruction from Images

Yunjie Wu, Zhengxing Sun, Youcheng Song and
Yunhan Sun, YiJie Zhong, Jinlong Shi

Keywords Paper

0

0

0

0

15:50

14/06/2020

TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model

Bo Pang, Yizhuo Li, Yifan Zhang and
Muchen Li, Cewu Lu

Keywords Paper

bounding-tube, mot, one-stage, tube-nms, fcn

0

0

0

0

4:55

22/11/2021

Simpler Does It: Generating Semantic Labels with Objectness Guidance

Md Amirul Islam, Matthew Kowal, Sen Jia and
Konstantinos Derpanis, Neil Bruce

Keywords Paper

Weakly supervised segmentation, semi supervised segmentation, Pseudo-label generation, Class Activation Maps, Objectness, Saliency

0

0

0

0

3:02

02/02/2021

Token-Aware Virtual Adversarial Training in Natural Language Understanding

Linyang Li, Xipeng Qiu

Keywords Paper

0

0

0

0

12:49

14/06/2020

How to Train Your Deep Multi-Object Tracker

Yihong Xu, Aljosa Osep, Yutong Ban and
Radu Horaud, Laura Leal-Taixé, Xavier Alameda-Pineda

Keywords Paper

deep multi-object tracking, mot end-to-end training, deep hungarian net, differentiable mota, differentiable motp, motchallenge benchmark, hungarian algorithm, vision-based tracking

0

0

0

0

1:01

19/08/2021

Learning Class-Transductive Intent Representations for Zero-shot Intent Detection

Qingyi Si, Yuanxin Liu, Peng Fu and
Zheng Lin, Jiangnan Li, Weiping Wang

Keywords Paper

Natural Language Processing, Natural Language Processing, Text Classification

0

0

0

0

10:03

16/11/2020

Named Entity Recognition Only from Word Embeddings

Ying Luo, Hai Zhao, Junlang Zhan

Keywords Paper

named recognition, entity detection, type prediction, deep models

0

0

0

0

9:54

02/02/2021

A Free Lunch for Unsupervised Domain Adaptive Object Detection without Source Data

Xianfeng Li, Weijie Chen, Di Xie and
Shicai Yang, Peng Yuan, Shiliang Pu, Yueting Zhuang

Keywords Paper

0

0

0

0

19:06

06/12/2020

Consistent Structural Relation Learning for Zero-Shot Segmentation

Peike Li, Yunchao Wei, Yi Yang

Keywords Paper

, Applications -> Computer Vision

0

0

0

0

3:11

02/02/2021

Task Aligned Generative Meta-learning for Zero-shot Learning

Zhe Liu, Yun Li, Lina Yao and
Xianzhi Wang, Guodong Long

Keywords Paper

0

0

0

0

15:48

14/06/2020

Learning Meta Face Recognition in Unseen Domains

Jianzhu Guo, Xiangyu Zhu, Chenxu Zhao and
Dong Cao, Zhen Lei, Stan Z. Li

Keywords Paper

face recognition, meta learning, domain generalization, metric learning

0

0

0

0

5:01

06/12/2020

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Guoliang Kang, Yunchao Wei, Yi Yang and
Yueting Zhuang, Alexander Hauptmann

Keywords Paper

0

0

0

0

3:16

22/11/2021

CamLessMonoDepth: Monocular Depth Estimation with Unknown Camera Parameters

Sai Shyam Chanduri, Igor Vozniak, Zeeshan Khan Suri

Keywords Paper

monocular depth estimation, self-supervised learning, single-camera egomotion, camera intrinsics estimation, sub-pixel convolutions, uncertainty estimation

0

0

0

0

3:02

14/06/2020

Overcoming Classifier Imbalance for Long-Tail Object Detection With Balanced Group Softmax

Yu Li, Tao Wang, Bingyi Kang and
Sheng Tang, Chunfeng Wang, Jintao Li, Jiashi Feng

Keywords Paper

object detection, long-tail, lvis, weight norm, classifier imbalance, balanced group softmax, bags, instance segmentation

0

0

0

0

4:57

06/12/2021

MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang and
Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

Keywords Paper

self-supervised learning, transformers, vision, language

0

0

0

0

7:13

05/01/2021

Multimodal Prototypical Networks for Few-Shot Learning

Frederik Pahde, Mihai Puscas, Tassilo Klein, Moin Nabi

Keywords Paper

0

0

0

0

4:56

14/06/2020

Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data

Shichao Li, Lei Ke, Kevin Pratama and
Yu-Wing Tai, Chi-Keung Tang, Kwang-Ting Cheng

Keywords Paper

3d human pose estimation, data augmentation, evolution algorithm, 2d-to-3d network, high-resolution heatmap regression, generalization

0

0

0

0

4:59

02/02/2021

Extracting Zero-shot Structured Information from Form-like Documents: Pretraining with Keys and Triggers

Rongyu Cao, Ping Luo

Keywords Paper

0

0

0

0

18:49

06/12/2021

Joint Semantic Mining for Weakly Supervised RGB-D Salient Object Detection

Jingjing Li, Wei Ji, Qi Bi and
Cheng Yan, Miao Zhang, Yongri Piao, Huchuan Lu, Li cheng

Keywords Paper

vision

0

0

0

0

9:03

14/06/2020

Modeling the Background for Incremental Learning in Semantic Segmentation

Fabio Cermelli, Massimiliano Mancini, Samuel Rota Bulò and
Elisa Ricci, Barbara Caputo

Keywords Paper

incremental, learning, semantic, segmentation, continual, catastrophic, forgetting, scene, parsing

0

0

0

0

1:01

06/12/2020

Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning

Weili Nie, Zhiding Yu, Lei Mao and
Ankit Patel, Yuke Zhu, Anima Anandkumar

Keywords Paper

0

0

0

0

3:23

02/02/2021

Precise Yet Efficient Semantic Calibration and Refinement in ConvNets for Real-time Polyp Segmentation from Colonoscopy Videos

Huisi Wu, Jiafu Zhong, Wei Wang and
Zhenkun Wen, Jing Qin

Keywords Paper

0

0

0

0

17:40

06/12/2021

Learning Compact Representations of Neural Networks using DiscriminAtive Masking (DAM)

Jie Bu, Arka Daw, M. Maruf, Anuj Karpatne

Keywords Paper

deep learning, machine learning, vision, graph learning, representation learning

0

0

0

0

13:59

16/11/2020

An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels

Ilias Chalkidis, Manos Fergadiotis, Sotiris Kotitsas and
Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos

Keywords Paper

flat classification, hierarchical approaches, zero-shot learning, few learning

0

0

0

0

12:21

14/06/2020

ProAlignNet: Unsupervised Learning for Progressively Aligning Noisy Contours

VSR Veeravasarapu, Abhishek Goel, Deepak Mittal, Maneesh Singh

Keywords Paper

shape alignment, label refinement, chamfer loss, unsupervised alignment, convnets

0

0

0

0

1:00

03/05/2021

BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Yuhang Li, Ruihao Gong, Xu Tan and
Yang Yang, Peng Hu, Qi Zhang, fengwei yu, Wei Wang, Shi Gu

Keywords Paper

Second-order analysis, Mixed Precision, Post Training Quantization

0

0

0

0

4:36

02/02/2021

LREN: Low-Rank Embedded Network for Sample-Free Hyperspectral Anomaly Detection

Kai Jiang, Weiying Xie, Jie Lei and
Tao Jiang, Yunsong Li

Keywords Paper

0

0

0

0

12:56

14/06/2020

GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking With 2D-3D Multi-Feature Learning

Xinshuo Weng, Yongxin Wang, Yunze Man, Kris M. Kitani

Keywords Paper

autonomous driving, 3d multi-object tracking, graph neural networks, multi-agent state estimation, point cloud processing, multi-modal representation learning

0

0

0

0

1:01

14/06/2020

PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition

Kun Su, Xiulong Liu, Eli Shlizerman

Keywords Paper

unsupervised learning, human skeleton, action recognition, spatial-temporal sequence, encoder-decoder, recurrent neural network, k-nearest neighborhood, clustering actions, motion prediction, body movements

0

0

0

0

0:58

26/04/2020

Meta-Learning without Memorization

Mingzhang Yin, George Tucker, Mingyuan Zhou and
Sergey Levine, Chelsea Finn

Keywords Paper

meta-learning, memorization, regularization, overfitting, mutually-exclusive

0

0

0

0

5:09

30/11/2020

Synthesizing the Unseen for Zero-shot Object Detection

Nasir Hayat, Munawar Hayat, Shafin Rahman and
Salman Khan, Syed Waqas Zamir, Fahad Shahbaz Khan

Keywords Paper

0

0

0

0

9:49

17/08/2020

Unsupervised k-modal styled content generation

Omry Sendik, Dani Lischinski, Daniel Cohen-Or

Keywords Paper

StyleGAN, generative adversarial networks, multi-modal distributions

0

0

0

0

11:37