Abstract:
Recent studies show that the end-to-end learning paradigm based on well-designed lifting networks merely using 2D joint locations as the input can achieve impressive performance in handling 3D human pose estimation problem. However, in the viewpoint of optimization design, existing methods of this category have two drawbacks: (1) The inherent feature relation between the 2D pose input and the corresponding 3D pose estimate is not sufficiently explored. (2) The regression procedure is usually performed in a one-step manner. To address these two issues, this paper proposes an efficient yet accurate method called Explicit Residual Descent (ERD). Given an arbitrary lifting network which takes 2D joint locations in a single image as the input and generates an initial 3D pose estimate, our ERD learns a sequence of descent directions encoded with a shared lightweight differentiable structure, progressively refining the previous 3D pose estimate via adding in a 3D increment obtained from projecting the reconstructed 2D pose features onto each learnt descent direction. Extensive experiments on public benchmarks including Human3.6M dataset validate the superior performance of the proposed method against state-of-the-art methods. Code will be made publicly available.