PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

About

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/, including both the training-and-inference masking variant and the training-only masking variant (= VGGT architecture at inference). Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)76.3	235
Camera pose estimation	Sintel	ATE0.143	203
Video Depth Estimation	BONN	AbsRel5.3	139
Camera pose estimation	TUM	ATE0.016	65
Point Cloud Reconstruction	DyCheck	Accuracy (Mean)0.403	40
Video Depth Estimation	DyCheck	Absolute Relative Error0.141	17
Video Depth Estimation	KITTI	FPS43.2	8
Novel View Synthesis	Nerfie dvd	PSNR18.355	5
Novel View Synthesis	Nerfie (hand8)	PSNR18.047	5
Novel View Synthesis	Nerfie tomato-mark8	PSNR18.511	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord