Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

C3G: Learning Compact 3D Representations with 2K Gaussians

About

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim, Kazumi Fukuda, Minkyeong Jeon, Jisang Han, Takuya Narihira, Hyuna Ko, Junsu Kim, Yuki Mitsufuji, Seungryong Kim• 2025

Related benchmarks

TaskDatasetResultRank
Correspondence estimationScanNet 1.0 (test)
PCK@10px (0°-15°)94.2
13
Novel View SynthesisRealEstate10K 80 (test)
PSNR22.387
10
Novel View SynthesisRealEstate10K 12 view
PSNR28.552
6
Novel View SynthesisRealEstate10K 24 view
PSNR29.987
6
Novel View SynthesisRealEstate10K 36 view
PSNR30.25
6
Open-Vocabulary SegmentationScanNet Target View
LSeg mIoU51.3
5
Open-Vocabulary SegmentationScanNet Source View
LSeg mIoU54.2
5
3D scene understandingReplica (Target View)
LSeg mIoU63
5
3D scene understandingReplica (Source View)
LSeg mIoU64.9
5
3D Scene ReconstructionScanNet Target View
MaskCLIP PSNR23.75
4
Showing 10 of 11 rows

Other info

Follow for update