Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

About

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

David Yifan Yao, Albert J. Zhai, Shenlong Wang• 2025

Related benchmarks

TaskDatasetResultRank
Camera pose estimationSintel
ATE0.116
92
Camera pose estimationTUM dynamics
RRE0.434
57
Pose EstimationBONN
ATE0.017
10
Showing 3 of 3 rows

Other info

Follow for update