We introduce a simple and effective network architecture for monocular 3D hand pose estimation consisting of an image encoder followed by a mesh convolutional decoder that is trained through a direct 3D hand mesh reconstruction loss. We train our network by gathering a large-scale dataset of hand action in YouTube videos and use it as a source of weak supervision. Our system largely outperforms state-of-the-art methods, even halving the errors on the in the wild benchmark.
I am a PhD student in the Department of Computing and High Performance Embedded and Distributed Systems Centre for Doctoral Training at Imperial College London. I am supervised by Prof. Michael Bronstein and Dr. Stefanos Zafeiriou. My research is fully funded by a scholarship from the EPSRC. I also work as a Computer Vision Scientist at Ariel AI.
Before starting a doctoral programme, I completed a Master of Research (MRes) degree in Advanced Computing at Imperial College London under the supervision of Dr. Stefanos Zafeiriou from the Intelligent Behaviour Understanding Group where I obtained a distinction grade. My research project was solving problems of dense hand pose estimation from images and statistical deformable modeling of the human hand.
Please, feel free to contact me.
Monocular 3D reconstruction of deformable objects, such as human body parts, has been typically approached by predicting parameters of heavyweight linear models. In this paper, we demonstrate an alternative solution that is based on the idea of encoding images into a latent non-linear representation of meshes.
The thesis presents a system that establishes a correspondence from hand pixels in an RGB image to the 3D hand model in an end-to-end manner. It is accompanied by a hand model capable of representing different shapes learned from hand scans and the first dataset of 1,500,000 hand models optimized to match the pose and shape of hands in colour images.