Abstract:
Deep learning based feature extraction combined with visual attention mechanism is shown to provide good results in content-based image retrieval (CBIR). Ideally, CBIR should rely on regions which contain objects of interest that appear in the query image. However, most existing attention models just predict the most likely region of interest based on the knowledge learned from the training dataset regardless of the content in the query image. As a result, they may look towards contexts outside the object of interest, especially when there are multiple potential objects of interest in a given image. In this paper, we propose a conditional attention model which is sensitive to the input query image content and can generate more accurate attention maps. A key-point detection and description based method is proposed for training data generation. Consequently, our model does not require any additional attention label for training. The proposed attention model enables the spatial pooling feature extraction method (generalized mean pooling) improves image feature representation and leads to better image retrieval performance. The proposed framework is tested on a series of databases where it is shown to perform well in challenging situations.