Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

About

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

Mohammad Asim, Christopher Wewer, Jan Eric Lenssen• 2026

Related benchmarks

TaskDatasetResultRank
Multi-view GenerationRealEstate10K
MEt3R0.0149
7
Novel View SynthesisDL3DV 140 (test)
PSNR21.95
6
Novel View SynthesisRE10K wide-view baseline (test)
PSNR23.99
5
Novel View SynthesisRE10K narrow-view baseline (test)
PSNR25.97
5
Novel View SynthesisACID Zero-Shot v1 (test)
PSNR24.21
4
Single-View Generation200 scenes 192 frames per scene (test)
PSNR15.12
4
Transferability of novel camera trajectoriesDL3DV-140
Rotation Accuracy (10 deg)75.81
3
Multi-view consistencyACID
MEt3R0.0133
3
Multi-view consistencyDL3DV
MEt3R Score0.0538
3
Showing 9 of 9 rows

Other info

Follow for update