Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

About

We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.

Hongming Fu, Wenjia Wang, Xiaozhen Qiao, Shuo Yang, Zheng Liu, Bo Zhao• 2026

Related benchmarks

TaskDatasetResultRank
3D Hand Pose EstimationH2O--
14
Hand Pose EstimationHOI4D (test)
G-MPJPE48.7
7
Object 6DoF Pose EstimationHOI4D
Local RRE11.65
7
Object 6DoF TrackingH2O
Local RRE23.24
7
Showing 4 of 4 rows

Other info

Follow for update