Abstract:
Current instance segmentation methods can be categorized into segmentation-based methods and proposal-based methods. The former performs segmentation first and then does clustering, while the latter detects objects first and then predicts the mask for each object proposal. In this work, we propose a single-stage method, named EmbedMask, that unifies both methods by taking their advantages, so it can achieve good performance in instance segmentation and produce high-resolution masks in a high speed. EmbedMask introduces two newly defined embeddings for mask prediction, which are pixel embedding and proposal embedding. During training, we enforce the pixel embedding to be close to its coupled proposal embedding if they belong to the same instance. During inference, pixels are assigned to the mask of the proposal if their embeddings are similar. This mechanism brings several benefits. First, the pixel-level clustering enables EmbedMask to generate high-resolution masks and avoids the complicated two-stage mask prediction. Second, the existence of proposal embedding simplifies and strengthens the clustering procedure, so our method can achieve high speed and better performance than segmentation-based methods. Without any bell or whistle, EmbedMask outperforms the state-of-the-art instance segmentation method Mask R-CNN on the challenging COCO dataset, obtaining more detailed masks at a higher speed.