Step-Wise Hierarchical Alignment Network for Image-Text Matching

Abstract: Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

02/02/2021

Shiming Chen, Guosen Xie, Yang Liu and
Qinmu Peng, Baigui Sun, Hao Li, Xinge You, Ling Shao

Siyang Yuan, Ke Bai, Liqun Chen and
Yizhe Zhang, Chenyang Tao, Chunyuan Li, Guoyin Wang, Ricardo Henao, Lawrence Carin Duke

domain adaptive object detection, image-level categorical regularization, categorical consistency regularization, domain adaptive faster r-cnn

1:00

16/11/2020

domain adaptive retrieval, bayesian formulation, learning to hash, transfer learning, focal-triplet loss, histogram feature of neighbors

1:03

19/08/2021

Shaofei Huang, Tianrui Hui, Si Liu and
Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, Bo Li

Pan Zhang, Bo Zhang, Dong Chen and
Lu Yuan, Fang Wen

Computer Vision, Language and Vision, Structural and Model-Based Approaches, Knowledge Representation and Reasoning, Information Extraction

14:33

30/11/2020

Wei Suo, MengYang Sun, Peng Wang, Qi Wu

vision and language, image search, text feedback, attention mechanism, transformer, multimodal learning, representation learning, composition, image retrieval, interactive image search

1:00

02/02/2021

Giovanni Paolini, Ben Athiwaratkun, Jason Krone and
Jie Ma, Alessandro Achille, RISHITA ANUBHAI, Cicero Nogueira dos Santos, Bing Xiang, Stefano Soatto

Keywords Paper

sequence to sequence, structured prediction, language models, transfer learning, few-shot learning, multi-task learning, generative modeling

12:16

06/12/2021