SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis
About
We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-scene generation | 3D-FRONT Bedroom (test) | FID109.5 | 10 | |
| Text-to-scene generation | 3D-FRONT Livingroom (test) | FID110.3 | 10 | |
| Text-to-scene generation | 3D-FRONT Diningroom (test) | FID129.7 | 10 | |
| 3D indoor scene synthesis from natural language | Bedroom | iRecall70.45 | 4 | |
| 3D indoor scene synthesis from natural language | Living room | iRecall0.5001 | 4 | |
| 3D indoor scene synthesis from natural language | Dining room | iRecall (%)56.29 | 4 | |
| Completion | Indoor Scenes (Bed) | iRecall (%)67.97 | 4 | |
| Re-arrangement | Indoor Scenes (Bed) | iRecall77.26 | 4 | |
| Re-arrangement | Indoor Scenes Dining | iRecall62 | 4 | |
| Completion | Indoor Scenes Living | iRecall40.61 | 4 |