Camera-Aware Cross-View Alignment for Referring 3D Gaussian Splatting Segmentation

About

Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to ground free-form language queries in 3D Gaussian fields. However, existing methods rely on single-view pseudo supervision, leading to viewpoint drift and inconsistent predictions across views. We propose CaRF (Camera-aware Referring Field), a camera-aware cross-view alignment framework for view-consistent referring in 3D Gaussian splatting. CaRF introduces Camera-conditioned Alignment Modulation (CAM) to inject camera geometry into Gaussian-text interactions, and Gaussian-level Cross-view Logit Alignment (GCLA) to explicitly align referring responses of the same Gaussians across calibrated views during training. By turning cross-view discrepancy into an optimizable objective, CaRF enables geometry-aware and view-consistent reasoning directly in the Gaussian space. Extensive experiments on three benchmarks demonstrate that CaRF achieves state-of-the-art performance, improving mIoU by 16.8%, 4.3%, and 2.0% on Ref-LERF, LERF-OVS, and 3D-OVS, respectively. Our code is available at https://github.com/eR3R3/CaRF.

Yuwen Tao, Kanglei Zhou, Xin Tan, Yuan Xie• 2025

Related benchmarks

Task	Dataset	Result
3D Open-vocabulary Segmentation	LERF-OVS	mIoU (Ramen)55.2	24
3D Language Grounding	Ref-LeRF	Ramen Score33.5	14
3D Open-vocabulary Segmentation	3D-OVS	mIoU (Bed)92.1	14

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord