Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DVSM: Decoder-only View Synthesis Model Done Right

About

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang• 2026

Related benchmarks

TaskDatasetResultRank
Novel View SynthesisMip-NeRF360 (test)
PSNR24.76
80
Novel View SynthesisDL3DV
PSNR29.71
75
View SynthesisRe10K (test)
PSNR31.23
23
View SynthesisFree (test)
PSNR25.57
6
View SynthesisHike (test)
PSNR23.1
6
Novel View SynthesisScanNet++ iPhone Official Evaluation (held-out set)
PSNR19.15
3
Showing 6 of 6 rows

Other info

Follow for update