Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

About

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Jeongjun Choi, Yeonsoo Park, H. Jin Kim• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-scene generation3D-FRONT Bedroom (test)
FID109.5
10
Text-to-scene generation3D-FRONT Livingroom (test)
FID110.3
10
Text-to-scene generation3D-FRONT Diningroom (test)
FID129.7
10
3D indoor scene synthesis from natural languageBedroom
iRecall70.45
4
3D indoor scene synthesis from natural languageLiving room
iRecall0.5001
4
3D indoor scene synthesis from natural languageDining room
iRecall (%)56.29
4
CompletionIndoor Scenes (Bed)
iRecall (%)67.97
4
Re-arrangementIndoor Scenes (Bed)
iRecall77.26
4
Re-arrangementIndoor Scenes Dining
iRecall62
4
CompletionIndoor Scenes Living
iRecall40.61
4
Showing 10 of 24 rows

Other info

Follow for update