Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

About

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu• 2024

Related benchmarks

TaskDatasetResultRank
3D Indoor Scene SynthesisBedroom (Standard Split)
CNR44.9
13
Indoor Scene Generation179 room-level prompts
Realism Win Rate95.4
12
Layout GenerationResidential BIM dataset (held-out)
Mean Navigability40.3
12
Scene Layout Generation11 room types
Position Error1.79
8
Indoor Scene SynthesisUser Study
Visual Quality3.26
8
3D Indoor Scene SynthesisLiving Room (Standard Split)
OBR14.3
7
3D Indoor Scene SynthesisAvg. Bed + Living (Standard Split)
OBR12.9
7
Controllable Indoor Scene SynthesisIndoor Scene Synthesis Controllability Evaluation
LF25
6
3D Scene SynthesisDetailed Language Instructions Bathroom
Object Count6.6
6
3D Scene SynthesisDetailed Language Instructions Living Room
Object Count5.6
6
Showing 10 of 34 rows

Other info

Code

Follow for update