22/11/2021

Image-Text Alignment using Adaptive Cross-attention with Transformer Encoder for Scene Graphs

Juyong Song, Sunghyun Choi

Keywords: cross-attention, multi-modal, retrieval, scene-graphs, graph neural networks, contrastive loss

Abstract: Neural image and text encoders have been proposed to align the abstract image and symbolic text representation. Global-local and local-local information integration between two modalities are essential for an effective alignment. In this paper, we present RELation-aware Adaptive Cross-attention (RELAX) that achieves state-of-the-art performance in cross-modal retrieval tasks by incorporating several novel improvements. First, cross-attention methods integrate global-local information via weighted global feature of a modality (taken as value) for a local feature of the other modality (taken as query). We can make more accurate alignments if we could also consider the global weights of the query modality. To this end, we introduce adaptive embedding to consider the weights. Second, to enhance the usage of scene-graphs that can capture the high-level relation of local features, we introduce transformer encoders for textual scene graphs to align with visual scene graphs. Lastly, we use NT-XEnt loss that takes the weighted sum of the samples based on their importance. We show that our approach is effective in extensive experiments that outperform other state-of-the-art models.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at BMVC 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers