Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TrajVG: 3D Trajectory-Coupled Visual Geometry Learning

About

Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.

Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, Junting Dong• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI
Abs Rel0.058
161
Monocular Depth EstimationNYU V2--
113
Video Depth EstimationSintel
Relative Error (Rel)0.188
109
Video Depth EstimationBONN
Relative Error (Rel)0.036
103
Camera pose estimationSintel
ATE0.108
92
Camera pose estimationScanNet
ATE RMSE (Avg.)0.03
61
Camera pose estimationTUM dynamics
RRE0.307
57
Video Depth EstimationKITTI
Abs Rel0.037
47
3D ReconstructionNeural RGB-D (NRGBD)
Acc Mean0.029
38
Monocular Depth EstimationSintel
Abs Rel0.297
21
Showing 10 of 18 rows

Other info

Follow for update