RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing
About
Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Indoor Scene Synthesis | Bedroom (Standard Split) | CNR0.00e+0 | 13 | |
| 3D Scene Synthesis | Detailed Language Instructions Average | Object Count (#Obj)16.5 | 11 | |
| Indoor Scene Synthesis | User Study | Visual Quality4.1 | 8 | |
| 3D Scene Synthesis | Detailed Language Instructions Living Room | Object Count26.3 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Dining Room | # Objects21.2 | 6 | |
| Controllable Indoor Scene Synthesis | Indoor Scene Synthesis Controllability Evaluation | LF58 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Kitchen | Object Count Score10.6 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Bathroom | Object Count10.2 | 6 |