EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

About

We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from dynamic egoview videos, supporting open-vocabulary objects. Accurate W-HOI reconstruction is critical for embodied intelligence yet remains challenging. Existing HOI methods are largely restricted to local camera coordinates or single frames, failing to capture global temporal dynamics. While some recent approaches attempt world-space hand estimation, they overlook object poses and HOI constraints. Moreover, previous HOI estimation methods either fail to handle open-set categories due to their reliance on object templates or employ differentiable rendering that requires per-instance optimization, resulting in prohibitive computational costs. Finally, frequent occlusions in egocentric videos severely degrade performance. To overcome these challenges, we propose a multi-stage framework: (i) a robust pre-processing pipeline leveraging vision foundation models for initial 3D scene, hand and object reconstruction; (ii) a body-guided diffusion model that incorporates explicit egocentric body priors for hand pose estimation; and (iii) an HOI-prior-informed diffusion model for hand-aware 6DoF pose infilling, ensuring physically plausible and temporally consistent W-HOI estimation. We experimentally demonstrate that EgoGrasp can achieve state-of-the-art performance in W-HOI reconstruction, handling multiple and open vocabulary objects robustly.

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Rolandos Alexandros Potamias, Taku Komura, Shuo Yang, Zheng Liu, Bo Zhao• 2026

Related benchmarks

Task	Dataset	Result
3D Hand Pose Estimation	H2O	--	14
Hand Pose Estimation	HOI4D (test)	G-MPJPE48.7	7
Object 6DoF Pose Estimation	HOI4D	Local RRE11.65	7
Object 6DoF Tracking	H2O	Local RRE23.24	7

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord