TrajVG: 3D Trajectory-Coupled Visual Geometry Learning

About

Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.

Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, Junting Dong• 2026

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)72.3	235
Monocular Depth Estimation	KITTI	Abs Rel0.058	220
Camera pose estimation	Sintel	ATE0.108	203
Monocular Depth Estimation	NYU V2	--	174
Video Depth Estimation	KITTI	Abs Rel0.037	148
Camera pose estimation	ScanNet	RPE (t)0.013	133
Monocular Depth Estimation	Sintel	Abs Rel0.297	127
Video Depth Estimation	BONN	Relative Error (Rel)0.036	108
Camera pose estimation	TUM dynamics	ATE0.011	90
3D Reconstruction	Neural RGB-D (NRGBD)	Acc Mean0.029	88

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord