LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
About
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Indoor Scene Synthesis | Bedroom (Standard Split) | CNR44.9 | 13 | |
| Indoor Scene Generation | 179 room-level prompts | Realism Win Rate95.4 | 12 | |
| Layout Generation | Residential BIM dataset (held-out) | Mean Navigability40.3 | 12 | |
| Scene Layout Generation | 11 room types | Position Error1.79 | 8 | |
| Indoor Scene Synthesis | User Study | Visual Quality3.26 | 8 | |
| 3D Indoor Scene Synthesis | Living Room (Standard Split) | OBR14.3 | 7 | |
| 3D Indoor Scene Synthesis | Avg. Bed + Living (Standard Split) | OBR12.9 | 7 | |
| Controllable Indoor Scene Synthesis | Indoor Scene Synthesis Controllability Evaluation | LF25 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Bathroom | Object Count6.6 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Living Room | Object Count5.6 | 6 |