Abstract:
3D human recovery from a single RGB image is a promising topic in virtual reality, augmented reality and computer vision, which focuses on estimating 3D pose and shape of human from a 2D image. Due to the lack of depth and local information, the task remains challenging. Targeting to solve these problems, this work proposes a LAMNet with three branches that learning attention map from depth and parsing features for 3D human recovery. The first branch explicitly leverages the depth and pose cues to learn depth attention map, which alleviates the recovery error between 3D space and 2D plane. The second branch explicitly leverages parsing cues as the local information of human, which supplements the local or edge details of 3D recovery. The last branch is the main branch which is responsible for estimating 3D pose and shape of human. Inspired by attention mechanism, it designs an attention aware fusion to integrating depth, parsing and global image cues, which effectively improves the precision of 3D recovery, especially in details and different perspectives. Extensive experimental results demonstrate that our proposed approach significantly outperforms most state-of-the-art methods on the popular Human3.6m, UP-3D, and 3DPW datasets.