Towards Egocentric 3D Hand Pose Estimation in Unseen Domains

About

We present V-HPOT, a novel approach for improving the cross-domain performance of 3D hand pose estimation from egocentric images across diverse, unseen domains. State-of-the-art methods demonstrate strong performance when trained and tested within the same domain. However, they struggle to generalise to new environments due to limited training data and depth perception -- overfitting to specific camera intrinsics. Our method addresses this by estimating keypoint z-coordinates in a virtual camera space, normalised by focal length and image size, enabling camera-agnostic depth prediction. We further leverage this invariance to camera intrinsics to propose a self-supervised test-time optimisation strategy that refines the model's depth perception during inference. This is achieved by applying a 3D consistency loss between predicted and in-space scale-transformed hand poses, allowing the model to adapt to target domain characteristics without requiring ground truth annotations. V-HPOT significantly improves 3D hand pose estimation performance in cross-domain scenarios, achieving a 71% reduction in mean pose error on the H2O dataset and a 41% reduction on the AssemblyHands dataset. Compared to state-of-the-art methods, V-HPOT outperforms all single-stage approaches across all datasets and competes closely with two-stage methods, despite needing approximately x3.5 to x14 less data.

Wiktor Mucha, Michael Wray, Martin Kampel• 2026

Related benchmarks

Task	Dataset	Result
3D Hand Pose Estimation	H2O	--	19
3D Hand Pose Estimation	H2O (same-domain)	MPJPE22.77	8
Egocentric 3D Hand Pose Estimation	AssemblyHands	MPJPE-RA92.09	7
Egocentric 3D Hand Pose Estimation	Epic-Kps	L2 Error8.45	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord