DVSM: Decoder-only View Synthesis Model Done Right
About
Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Mip-NeRF360 (test) | PSNR24.76 | 80 | |
| Novel View Synthesis | DL3DV | PSNR29.71 | 75 | |
| View Synthesis | Re10K (test) | PSNR31.23 | 23 | |
| View Synthesis | Free (test) | PSNR25.57 | 6 | |
| View Synthesis | Hike (test) | PSNR23.1 | 6 | |
| Novel View Synthesis | ScanNet++ iPhone Official Evaluation (held-out set) | PSNR19.15 | 3 |