SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

About

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Jeongjun Choi, Yeonsoo Park, H. Jin Kim• 2026

Related benchmarks

Task	Dataset	Result
Text-to-scene generation	3D-FRONT Bedroom (test)	FID109.5	10
Text-to-scene generation	3D-FRONT Livingroom (test)	FID110.3	10
Text-to-scene generation	3D-FRONT Diningroom (test)	FID129.7	10
3D indoor scene synthesis from natural language	Bedroom	iRecall70.45	4
3D indoor scene synthesis from natural language	Living room	iRecall0.5001	4
3D indoor scene synthesis from natural language	Dining room	iRecall (%)56.29	4
Completion	Indoor Scenes (Bed)	iRecall (%)67.97	4
Re-arrangement	Indoor Scenes (Bed)	iRecall77.26	4
Re-arrangement	Indoor Scenes Dining	iRecall62	4
Completion	Indoor Scenes Living	iRecall40.61	4

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord