LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

About

We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object segmentation tasks, highlighting the versatility of our approach.

Juliette Marrie, Romain Menegaux, Michael Arbel, Diane Larlus, Julien Mairal• 2024

Related benchmarks

Task	Dataset	Result
3D object selection	LERF-OVS	mIoU (Mean)39.28	21
3D Semantic Segmentation	ScanNet	mIoU (10 classes)40.47	17
Open-Vocabulary 3D Semantic Segmentation	ScanNet 19 classes	mIoU33.9	17
Open-Vocabulary 3D Semantic Segmentation	ScanNet 10 classes	mIoU46.4	17
Open-Vocabulary 3D Semantic Segmentation	ScanNet 15 classes	mIoU37.4	17
3D Semantic Segmentation	ScanNet 10 classes	mIoU41.11	17
3D Semantic Segmentation	ScanNet 15 classes	mIoU33.73	17
Open-vocabulary 3D object selection	LERF	Ramen Score42.3	16
3D object selection	LERF figurines scene	Peak VRAM22	14
3D Semantic Segmentation	ScanNet 200 70 classes	mIoU21.23	10

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord