InstructScene: Instruction-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior
About
Comprehending natural language instructions is a charming property for 3D indoor scene synthesis systems. Existing methods directly model object joint distributions and express object relations implicitly within a scene, thereby hindering the controllability of generation. We introduce InstructScene, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 3D scene synthesis. The proposed semantic graph prior jointly learns scene appearances and layout distributions, exhibiting versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 3D scene synthesis, we curate a high-quality dataset of scene-instruction pairs with large language and multimodal models. Extensive experimental results reveal that the proposed method surpasses existing state-of-the-art approaches by a large margin. Thorough ablation studies confirm the efficacy of crucial design components. Project page: https://chenguolin.github.io/projects/InstructScene.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-scene generation | 3D-FRONT Diningroom (test) | FID129.1 | 10 | |
| Text-to-scene generation | 3D-FRONT Bedroom (test) | FID114.9 | 10 | |
| Text-to-scene generation | 3D-FRONT Livingroom (test) | FID111.5 | 10 | |
| Completion | Indoor Scenes Living | iRecall44.49 | 4 | |
| Completion | Indoor Scenes Dining | iRecall (%)0.5356 | 4 | |
| Indoor Scene Stylization | Bedroom (test) | Delta (1e-3)7.03 | 4 | |
| Re-arrangement | Indoor Scenes Living | iRecall58.16 | 4 | |
| Unconditional Generation | Indoor Scenes Living | FID117.6 | 4 | |
| Unconditional Generation | Indoor Scenes Dining | FID138.3 | 4 | |
| 3D indoor scene synthesis from natural language | Bedroom | iRecall66.72 | 4 |