22/11/2021

Paying Attention to Varying Receptive Fields: Object Detection with Atrous Filters and Vision Transformers

Arthur Jian Shun Lam, Jun Yi Lim, Ricky Sutopo, Vishnu Monn Baskaran

Keywords: object detection, atrous convolution, vision transformers, attention mechanism

Abstract: Object detection represents a critical component in computer vision based on its unique ability to identify the location of one or more objects in an image or video. Given its importance, various approaches were proposed in an attempt to extract meaningful and representative features across different image scales. One such approach would be to vary the receptive fields during the feature extraction process. However, varying and adjusting the receptive field adds complexity to the process of scene understanding by introducing a higher degree of unimportant semantics into the feature maps. To solve this problem, we propose a novel object detection framework by unifying dilation modules (or atrous convolutions) with a vision transformer (DIL-ViT). The proposed model leverages atrous convolutions to generate rich multi-scale feature maps and employs a self-attention mechanism to enrich important backbone features. Specifically, the dilation (i.e., DIL) module enables feature fusions across varying scales from a single input feature map of specific scales. Through this method, we incorporate coarse semantics and fine details into the feature maps by convolving the features with different atrous rates in a multi-branch multi-level structure. By embedding DIL into various object detectors, we observe notable improvements in all of the compared evaluation metrics using the MS-COCO dataset. To further enhance the feature maps produced by the DIL, we then apply channel-wise attention using a vision transformer (i.e., ViT). Crucially, this approach removes unnecessary semantics present in the fused multi-scale feature map. Experimental results of DIL-ViT on the MS-COCO dataset exhibit substantial improvements in all of the compared evaluation metrics.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at BMVC 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers