SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

About

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian• 2026

Related benchmarks

Task	Dataset	Result
3D Indoor Scene Synthesis	Bedroom (Standard Split)	CNR4.6	13
Scene editing	AuthorBench	All-Goal Success12	10
3D Indoor Scene Synthesis	Living Room (Standard Split)	OBR1.2	7
3D Indoor Scene Synthesis	Avg. Bed + Living (Standard Split)	OBR2	7
3D Indoor Scene Synthesis	User Study	Physical Plausibility1.8	5
3D Indoor Scene Synthesis	Dining Room (Generalization Split)	OBR0.1	5
3D Indoor Scene Synthesis	Study Room (Generalization Split)	Object Realism (OBR)0.5	5
3D Indoor Scene Synthesis	Dining + Study Average (Generalization Split)	OBR (Object Realism)0.3	5
3D Scene Editing Evaluation	Human Evaluation Study	Satisfaction Score3.2	4
Goal-oriented Scene Optimization	SceneChain-12k Cond 1: Chaotic & Missing	OBR1.1	3

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord