Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction
About
Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage training strategy: the first stage performs multi-task learning on a large-scale HOI dataset to capture the underlying correlations among the three modalities, while the second stage fine-tunes the model on specific tasks to further enhance performance. Extensive experiments demonstrate that Uni-HOI achieves remarkable performances on multiple HOI-related tasks including text-driven HOI generation, object motion-driven human motion generation (optionally with text) and human motion-driven object motion prediction within a unified framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-Object Interaction Synthesis | FullBodyManipulation (test) | FID5.13 | 19 | |
| Human-Object Interaction Motion Generation | FullBodyManipulation | C_prec86 | 12 | |
| Text-driven HOI generation | BEHAVE (test) | FID0.37 | 5 | |
| human motion-driven object motion prediction | GRAB | Ec0.024 | 2 | |
| human motion-driven object motion prediction | BEHAVE | Ec0.092 | 2 |