Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

About

Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved promising results on specific HOI tasks-such as text-conditioned HOI generation and human motion generation from object motion, they typically rely on task-specific architectures and lack a unified framework capable of handling diverse conditional inputs. Building on this, we propose Uni-HOI, a unified framework that learns the joint distribution among text, human motion, and object motion. By leveraging large language models (LLMs) and two motion-specific vector quantized variational autoencoders (VQ-VAEs), we convert heterogeneous motion data into token sequences compatible with LLM inputs, enabling seamless integration and joint modeling of all three modalities. We introduce a two-stage training strategy: the first stage performs multi-task learning on a large-scale HOI dataset to capture the underlying correlations among the three modalities, while the second stage fine-tunes the model on specific tasks to further enhance performance. Extensive experiments demonstrate that Uni-HOI achieves remarkable performances on multiple HOI-related tasks including text-driven HOI generation, object motion-driven human motion generation (optionally with text) and human motion-driven object motion prediction within a unified framework.

Mengfei Zhang, Jinlu Zhang, Zhigang Tu• 2026

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Synthesis	FullBodyManipulation (test)	FID5.13	19
Human-Object Interaction Motion Generation	FullBodyManipulation	C_prec86	12
Text-driven HOI generation	BEHAVE (test)	FID0.37	5
human motion-driven object motion prediction	GRAB	Ec0.024	2
human motion-driven object motion prediction	BEHAVE	Ec0.092	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord