THOM: Generating Physically Plausible Hand-Object Meshes From Text

About

Generating photorealistic 3D hand-object interactions (HOIs) from text is important for applications like robotic grasping and AR/VR content creation. In practice, however, achieving both visual fidelity and physical plausibility remains difficult, as mesh extraction from text-generated Gaussians is inherently ill-posed and the resulting meshes are often unreliable for physics-based optimization. We present THOM, a training-free framework that generates physically plausible 3D HOI meshes directly from text prompts, without requiring template object meshes. THOM follows a two-stage pipeline: it first generates hand and object Gaussians guided by text, and then refines their interaction using physics-based optimization. To enable reliable interaction modeling, we introduce a mesh extraction method with an explicit vertex-to-Gaussian mapping, which enables topology-aware regularization. We further improve physical plausibility through contact-aware optimization and vision-language model (VLM)-guided translation refinement. Extensive experiments show that THOM produces high-quality HOIs with strong text alignment, visual realism, and interaction plausibility.

Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim• 2026

Related benchmarks

Task	Dataset	Result	Rank
Text-to-3D Human-Object Interaction Generation	T3Bench 100 prompts	CLIP Score31.4		7
Text-to-3D Human-Object Interaction Generation	100 Text-to-HOI Prompts T3Bench & GPT-4o (test)	Max Penetration Depth2.2		3

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord