Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

About

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

David Yifan Yao, Albert J. Zhai, Shenlong Wang• 2025

Related benchmarks

Task	Dataset	Result
Camera pose estimation	Sintel	ATE0.116	203
Camera pose estimation	TUM dynamics	ATE0.039	90
Pose Estimation	BONN	ATE0.017	38

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord