Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias
About
We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel View Synthesis | Google Scanned Objects (GSO) (test) | PSNR21.427 | 24 | |
| Single-view 3D Reconstruction | SRN Cars (test) | PSNR24.018 | 7 | |
| Single-view Reconstruction | CO3D Hydrant (held-out target view) | PSNR20.366 | 2 | |
| Single-view Reconstruction | CO3D Teddybear (held-out target view) | PSNR19.005 | 2 |