Abstract:
Weakly supervised object localization (WSOL) locates the target object within an image using only image-level labels. Recent methods try to extend the feature activation to cover entire object regions by dropping the most discriminative parts. However, they either overextend the activation into the background or are still limited to covering the most discriminative parts. In this paper, we propose a novel WSOL framework that localizes the entire object to the right extent via contrastive learning. Our framework contains three key components: 1) scheduled region drop, 2) contrastive guidance, and 3) pairwise non-local block. The scheduled region drop progressively erases the most discriminative parts of the original feature at a region-level. The erased feature facilitates the network to discover less discriminative regions in the original feature. Then, our contrastive guidance encourages the foregrounds of the original and erased features to be closer while pushing away from each background. In this manner, the network earns the capacity to differentiate the foregrounds from backgrounds, spreading out the activation within object regions. Last but not least, we utilize the pairwise non-local block, which provides an enhanced attention map to strengthen the spatial correlations between each pixel. In conclusion, our method achieves the state-of-the-art performance on CUB-200-2011 and ImageNet benchmarks regarding Top-1 Loc, GT-Loc and MaxBoxAccV2.