Abstract:
Object detection represents a critical component in computer vision based on its unique ability to identify the location of one or more objects in an image or video. Given its importance, various approaches were proposed in an attempt to extract meaningful and representative features across different image scales. One such approach would be to vary the receptive fields during the feature extraction process. However, varying and adjusting the receptive field adds complexity to the process of scene understanding by introducing a higher degree of unimportant semantics into the feature maps. To solve this problem, we propose a novel object detection framework by unifying dilation modules (or atrous convolutions) with a vision transformer (DIL-ViT). The proposed model leverages atrous convolutions to generate rich multi-scale feature maps and employs a self-attention mechanism to enrich important backbone features. Specifically, the dilation (i.e., DIL) module enables feature fusions across varying scales from a single input feature map of specific scales. Through this method, we incorporate coarse semantics and fine details into the feature maps by convolving the features with different atrous rates in a multi-branch multi-level structure. By embedding DIL into various object detectors, we observe notable improvements in all of the compared evaluation metrics using the MS-COCO dataset. To further enhance the feature maps produced by the DIL, we then apply channel-wise attention using a vision transformer (i.e., ViT). Crucially, this approach removes unnecessary semantics present in the fused multi-scale feature map. Experimental results of DIL-ViT on the MS-COCO dataset exhibit substantial improvements in all of the compared evaluation metrics.