Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder

About

Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.

Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang, Tongliang Liu, Mingming Gong, Jiawang Bian• 2026

Related benchmarks

TaskDatasetResultRank
1-view-based novel view generationRealEstate10K
PSNR21.57
7
1-view-based novel view generationDL3DV-10K
PSNR17.19
7
Single-image world generationWorldScore Indoor
3D Consistency84.98
7
Single-image world generationDL3DV
3D Consistency78.21
7
3D Gaussian SplattingRealEstate10K
PSNR28.19
5
3D Gaussian SplattingDL3DV
PSNR24.68
5
Showing 6 of 6 rows

Other info

GitHub

Follow for update