Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

About

We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

Zhiyuan Xu, Jiuming Liu, Yuxin Chen, Masayoshi Tomizuka, Chenfeng Xu, Chensheng Peng• 2026

Related benchmarks

Task	Dataset	Result
Novel View Synthesis	Google Scanned Objects (GSO) (test)	PSNR21.427	24
Single-view 3D Reconstruction	SRN Cars (test)	PSNR24.018	9
Single-view Reconstruction	CO3D Hydrant (held-out target view)	PSNR20.366	2
Single-view Reconstruction	CO3D Teddybear (held-out target view)	PSNR19.005	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord