Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

About

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/. Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)76.3
235
Camera pose estimationSintel
ATE0.143
203
Video Depth EstimationBONN
AbsRel5.3
131
Camera pose estimationTUM
ATE0.016
59
Point Cloud ReconstructionDyCheck
Accuracy (Mean)0.403
27
Video Depth EstimationDyCheck
Absolute Relative Error0.141
17
Video Depth EstimationKITTI
FPS43.2
8
Novel View SynthesisNerfie dvd
PSNR18.355
5
Novel View SynthesisNerfie (hand8)
PSNR18.047
5
Novel View SynthesisNerfie tomato-mark8
PSNR18.511
5
Showing 10 of 13 rows

Other info

Follow for update