Abstract:
Removing particular objects in a video and filling up the corresponding blank regions with a plausible background is a challenging and often ill-posed task. In this paper, we propose a framework to solve this difficult problem in complex, dynamic scenes by leveraging multi-view geometry and convolutional neural networks based approaches. Given an input video with undesired object masks, we first extract the depth map and relative camera pose for each of the input frames. We then fuse the estimated depth and pose to create a global 3D scene reconstruction. By projecting the point clouds from the reconstructed grid volume, we can fill in the most of the regions masked in the original input. We then use learning-based approaches to inpaint the remaining pixels in the input video which could not be resolved by 3D reconstruction. Compared with previous video inpainting approaches, our system generates superior qualitative results on the DAVIS 2016 and KITTI datasets, particularly in scenes where multiple, large objects are removed.