Variational Context-Deformable ConvNets for Indoor Scene Parsing

Abstract: Context information is critical for image semantic segmentation. Especially in indoor scenes, the large variation of object scales makes spatial-context an important factor for improving the segmentation performance. Thus, in this paper, we propose a novel variational context-deformable (VCD) module to learn adaptive receptive-field in a structured fashion. Different from standard ConvNets, which share fixed-size spatial context for all pixels, the VCD module learns a deformable spatial-context with the guidance of depth information: depth information provides clues for identifying real local neighborhoods. Specifically, adaptive Gaussian kernels are learned with the guidance of multimodal information. By multiplying the learned Gaussian kernel with standard convolution filters, the VCD module can aggregate flexible spatial context for each pixel during convolution. The main contributions of this work are as follows: 1) a novel VCD module is proposed, which exploits learnable Gaussian kernels to enable feature learning with structured adaptive-context. 2) variational Bayesian probabilistic modeling is introduced for the training of VCD module, which can make it continuous and more stable. 3) a perspective-aware guidance module is designed to take advantage of multi-modal information for RGB-D segmentation. We evaluate the proposed approach on three widely-used datasets, and the performance improvement has shown the effectiveness of the proposed method.