HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

About

Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further enhance generalization, HO-Flow predicts object motions relative to the initial frame, enabling effective pre-training on large-scale synthetic data. Experiments on the GRAB, OakInk, and DexYCB benchmarks demonstrate that HO-Flow achieves state-of-the-art performance in both physical plausibility and motion diversity for interaction motion synthesis.

Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid, Stefanos Zafeiriou• 2026

Related benchmarks

Task	Dataset	Result
Hand-object interaction motion synthesis	OakInk (out-of-distribution)	IVr Error4.1	6
Hand-Object Interaction	GRAB (test)	IVr5.48	6
Single-hand Manipulation Synthesis	DexYCB	IV6.84	4
Hand-Object Interaction Synthesis	GRAB	Preference Ratio63	2
Hand-Object Interaction Synthesis	OakInk	Preference Ratio72	2
Hand-Object Interaction Synthesis	DexYCB	Preference Ratio62	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord