Abstract:
Although fully convolution networks (FCN) have dominated semantic segmentation since the birth of, they are inherently limited in capturing long-range structured relationship with the layers of local kernels. While recent Transformer-based models have proven extremely successful in computer vision tasks by capturing global representation, they would deteriorate semantic segmentation by over-smoothing the regions contain fine details (e.g., boundaries and small objects). To this end, we propose a Dual-Stream Convolution-Transformer segmentation framework, called DSCT, by taking advantage of both the convolution and Transformer to learn a rich feature representation for semantic segmentation. Specifically, DSCT extracts high resolution local feature information from convolution layers and global feature representation across the Transformer layers. Moreover, a feature fusion module is plugged to exchange information between spatial stream and context stream at each stage. With the local and global context modeled explicitly in every layer, the two streams can be combined with a simple decoder to provide a powerful segmentation model. Extensive experiments show that our model builds a new state of the art on Cityscapes dataset (83.31% mIoU) with only 80K training iterations and appealing performance (49.27% mIoU) on ADE20K, outperforming most of the alternatives with a new perspective.