RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
About
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Indoor Scene Synthesis | Bedroom (Standard Split) | CNR0.00e+0 | 13 | |
| Indoor Scene Synthesis | User Study | Visual Quality4.1 | 8 | |
| 3D Scene Synthesis | Detailed Language Instructions Living Room | Object Count26.3 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Dining Room | # Objects21.2 | 6 | |
| Controllable Indoor Scene Synthesis | Indoor Scene Synthesis Controllability Evaluation | LF58 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Kitchen | Object Count Score10.6 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Bathroom | Object Count10.2 | 6 | |
| 3D Scene Synthesis | Detailed Language Instructions Average | Object Count (#Obj)16.5 | 6 |