SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL
About
Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Indoor Scene Synthesis | Bedroom (Standard Split) | CNR4.6 | 13 | |
| 3D Indoor Scene Synthesis | Living Room (Standard Split) | OBR1.2 | 7 | |
| 3D Indoor Scene Synthesis | Avg. Bed + Living (Standard Split) | OBR2 | 7 | |
| 3D Indoor Scene Synthesis | User Study | Physical Plausibility1.8 | 5 | |
| 3D Indoor Scene Synthesis | Dining Room (Generalization Split) | OBR0.1 | 5 | |
| 3D Indoor Scene Synthesis | Study Room (Generalization Split) | Object Realism (OBR)0.5 | 5 | |
| 3D Indoor Scene Synthesis | Dining + Study Average (Generalization Split) | OBR (Object Realism)0.3 | 5 | |
| Goal-oriented Scene Optimization | SceneChain-12k Cond 1: Chaotic & Missing | OBR1.1 | 3 | |
| Goal-oriented Scene Optimization | SceneChain Cond 2: Chaotic Only 12k | OBR2.7 | 3 | |
| Goal-oriented Scene Optimization | SceneChain-12k Cond 3: Missing Only | OBR0.011 | 3 |