Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

About

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang• 2025

Related benchmarks

TaskDatasetResultRank
3D Motion GenerationUser Study
Motion Realism Preference80
10
LieInterPlay
Completion Rate62
10
ReachInterPlay
Completion Rate97.5
10
SitInterPlay
Completion Rate92.6
10
CarryInterPlay
Completion Rate88.3
9
OpenInterPlay
Completion Rate91.2
9
PushInterPlay
Completion Rate84.1
9
Long-horizon multi-task Human-Object InteractionInterPlay Static Interaction
Completion Rate75.1
5
Long-horizon multi-task Human-Object InteractionInterPlay Dynamic Interaction
Completion Rate71.2
4
Long-horizon multi-task Human-Object InteractionInterPlay Hybrid
Completion Rate53.8
4
Showing 10 of 10 rows

Other info

Follow for update