Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation

About

Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. Yet, existing methods cannot synthesize interactions in unseen environments such as in-the-wild scenes or reconstructed scenes, as they rely on paired 3D scenes and captured human motion data for training, which are unavailable for unseen environments. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis, eliminating the need for training on any MoCap data. Our key insight is to distill human-scene interactions from state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.

Hongjie Li, Hong-Xing Yu, Jiaman Li, Jiajun Wu• 2024

Related benchmarks

TaskDatasetResultRank
Human-Object Interaction Quality EvaluationArticulated Human-Object Interaction Scenes
X-CLIP Score20.4
5
Human-Object Interaction ReconstructionRigid Object Interactions Monocular RGB
Foot Sliding0.41
3
Showing 2 of 2 rows

Other info

Follow for update