4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

About

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen• 2025

Related benchmarks

Task	Dataset	Result
3D/4D Video Question Answering	STI-Bench	Accuracy37.6	12
3D/4D Video Question Answering	MMSI-Bench	Accuracy33.3	12
3D/4D Video Question Answering	VLM4D real	Accuracy52.7	11
4D Understanding	R4D-Bench (All)	Average Score42.2	9
3D/4D Visual Question Answering	STI Bench 1.0 (test)	Accuracy59.1	8
3D/4D Video Question Answering	SAT	Accuracy64.7	8
3D/4D Video Question Answering	OmniSpatial	Accuracy40.4	7
3D/4D Video Question Answering	VSTI-Bench	Accuracy59.1	5
3D/4D Visual Question Answering	VLM4D-real 1.0 (test)	Accuracy (MCQ)53.7	4
3D/4D Visual Question Answering	MMSI Bench 1.0 (test)	Avg Multiple Choice Accuracy33.3	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord