Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

About

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen• 2025

Related benchmarks

TaskDatasetResultRank
3D/4D Video Question AnsweringSTI-Bench
Accuracy37.6
12
3D/4D Video Question AnsweringMMSI-Bench
Accuracy33.3
12
3D/4D Video Question AnsweringVLM4D real
Accuracy52.7
11
4D UnderstandingR4D-Bench (All)
Average Score42.2
9
3D/4D Visual Question AnsweringSTI Bench 1.0 (test)
Accuracy59.1
8
3D/4D Video Question AnsweringSAT
Accuracy64.7
8
3D/4D Video Question AnsweringOmniSpatial
Accuracy40.4
7
3D/4D Video Question AnsweringVSTI-Bench
Accuracy59.1
5
3D/4D Visual Question AnsweringVLM4D-real 1.0 (test)
Accuracy (MCQ)53.7
4
3D/4D Visual Question AnsweringMMSI Bench 1.0 (test)
Avg Multiple Choice Accuracy33.3
4
Showing 10 of 12 rows

Other info

Follow for update