FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views

About

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: https://fangzhou2000.github.io/projects/fleg.

Qijian Tian, Xin Tan, Jiayu Ying, Xuhong Wang, Yuan Xie, Lizhuang Ma• 2025

Related benchmarks

Task	Dataset	Result
Open Vocabulary Semantic Segmentation	ScanNet v2 (test)	mIoU47.59	16
Novel View Synthesis	ScanNet v2 (test)	PSNR24.2	12
Open Vocabulary Semantic Segmentation	ScanNet	mIoU46.56	6

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord