$\pi^3$: Permutation-Equivariant Visual Geometry Learning

About

We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In contrast, $\pi^3$ employs a fully permutation-equivariant architecture to predict affine-invariant camera poses and scale-invariant local point maps without any reference frames. This design not only makes our model inherently robust to input ordering, but also leads to higher accuracy and performance. These advantages enable our simple and bias-free approach to achieve state-of-the-art performance on a wide range of tasks, including camera pose estimation, monocular/video depth estimation, and dense point map reconstruction. Code and models are available at https://github.com/yyfz/Pi3.

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He• 2025

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)73.2	235
Monocular Depth Estimation	KITTI	Abs Rel0.06	220
Camera pose estimation	TUM-dynamic	ATE0.014	205
Camera pose estimation	Sintel	ATE0.073	203
Monocular Depth Estimation	NYU V2	Delta 1 Acc98.6	174
Video Depth Estimation	KITTI	Abs Rel0.037	148
Camera pose estimation	ScanNet	RPE (t)0.012	133
Video Depth Estimation	BONN	AbsRel4.9	131
3D Reconstruction	7 Scenes	--	128
Monocular Depth Estimation	Sintel	Abs Rel0.1489	127

Showing 10 of 302 rows

...

Other info

Follow for update

@wizwand_team Discord