Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

About

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at https://github.com/yyfz/Pi3.

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He• 2025

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI
Abs Rel0.06
203
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)73.2
193
Camera pose estimationSintel
ATE0.073
192
Camera pose estimationTUM-dynamic
ATE0.014
163
Monocular Depth EstimationNYU V2
Delta 1 Acc98.6
131
Video Depth EstimationKITTI
Abs Rel0.037
126
Camera pose estimationScanNet
RPE (t)0.012
119
Video Depth EstimationBONN
AbsRel4.9
116
Video Depth EstimationBONN
Relative Error (Rel)0.043
103
3D Reconstruction7 Scenes--
94
Showing 10 of 190 rows
...

Other info

Follow for update