Abstract:
Video decaptioning aims to remove subtitles from and repair occluded areas in videos. However, recent deep-learning-based inpainting methods mostly require the masks indicating the corrupted parts, and these masks are unavailable for the input subtitled videos. Moreover, useful information hidden in the background of subtitles might be lost when these masked areas are directly regarded as invalid as the common setting of inpainting methods. In addition, existing blind video decaptioning methods often suffer from incomplete subtitles removal. In this paper, we propose a generic framework for video decaptioning, which consists of a caption mask extraction network and a frame-attention-based decaptioning network. The former is trained with supervision information using our proposed automatic annotation method, and predicts the position of the subtitle and background. The latter adopts an encoder-decoder architecture with the skip connection. The encoder extracts the features of all input frames. Then, multiple frame attention modules are used to aggregate these features from the spatial and temporal dimensions. Finally, the fused features are reconstructed into a target frame using the decoder. Extensive experiments demonstrate that our proposed method can accurately remove subtitles from videos in real time (60+ FPS), and outperforms the state-of-the-art approaches. Code is available at https://github.com/Linya-lab/Video_Decaptioning.