4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

About

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister• 2025

Related benchmarks

Task	Dataset	Result
3D Semantic Segmentation	ScanNet (test)	mIoU7.22	117
3D Semantic Mapping	Replica	mAcc22.93	25
Novel View Reconstruction	HyperNeRF 4D LangSplat (test)	Americano Score87	20
Novel View Reconstruction	HyperNeRF held-out 4D LangSplat (test)	Americano Score28	20
Time-agnostic querying	HyperNeRF (test)	mIoU82.95	10
Time-agnostic querying	Neu3D (test)	mIoU85.19	10
Open-vocabulary 4D querying	HyperNeRF americano scene	Mean Accuracy98.02	6
Time-sensitive language queries	HyperNeRF	Americano Acc89.42	6
Open-vocabulary 4D querying	HyperNeRF espresso	mAcc96.33	6
Video Classification	UCF101 v1 (test)	--	5

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord