22/11/2021

Rethinking local and global feature representation for semantic segmentation

Mohan Chen, Xinxuan Zhao, Bingfei Fu, Li Zhang, Xiangyang Xue

Keywords: Semantic Segmentation, Transformer

Abstract: Although fully convolution networks (FCN) have dominated semantic segmentation since the birth of, they are inherently limited in capturing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate semantic segmentation by over-smoothing the regions contain fine details (e.g., boundaries and small objects). To this end, we propose a Dual-Stream Convolution-Transformer segmentation framework, called DSCT, by taking advantage of both the convolution and Transformer to learn a rich feature representation for semantic segmentation. Specifically, DSCT extracts high resolution local feature information from convolution layers and global feature representation across the Transformer layers. Moreover, a feature fusion module is plugged to exchange information between spatial stream and context stream at each stage. With the local and global context modeled explicitly in every layer, the two streams can be combined with a simple decoder to provide a powerful segmentation model. Extensive experiments show that our model builds a new state of the art on Cityscapes dataset (83.31% mIoU) with only 80K training iterations and appealing performance (49.27% mIoU) on ADE20K, outperforming most of the alternatives with a new perspective.

 0
 0
 0
 0
This is an embedded video. Talk and the respective paper are published at BMVC 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment
no comments yet
code of conduct: tbd Characters remaining: 140

Similar Papers