LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

About

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

Yilong Liu, Wanhua Li, Chen Zhu-Tian, Hanspeter Pfister• 2026

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	3D-OVS	Bed67.8	55
Novel View Synthesis	ScanNet Target View (40 unseen scenes)	PSNR24.8	12
Open Vocabulary Semantic Segmentation	ScanNet Source View (40 unseen scenes)	mIoU73.44	11
Open Vocabulary Semantic Segmentation	ScanNet Target View (40 unseen scenes)	mIoU74.16	11
3D Semantic Segmentation	RE10k (unseen)	mIoU (Bedroom)34.33	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord