LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

About

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu• 2024

Related benchmarks

Task	Dataset	Result
3D Indoor Scene Synthesis	Bedroom (Standard Split)	CNR44.9	13
Indoor Scene Generation	179 room-level prompts	Realism Win Rate95.4	12
Layout Generation	Residential BIM dataset (held-out)	Mean Navigability40.3	12
Scene Generation	Procedural Scene Generation	Collision Rate27	12
3D Scene Synthesis	Detailed Language Instructions Average	Object Count (#Obj)5.6	11
Scene editing	AuthorBench	All-Goal Success26.9	10
Scene Layout Generation	11 room types	Position Error1.79	8
3D Layout Generation	LayoutVLM benchmark	CF81.8	8
Indoor Scene Synthesis	User Study	Visual Quality3.26	8
3D Indoor Scene Layout Generation	3D Indoor Scene Layouts User Study	Votes71	7

Showing 10 of 95 rows

...

Other info

Code

Follow for update

@wizwand_team Discord