Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniT: Unified Geometry Learning with Group Autoregressive Transformer

About

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)65.4
235
Monocular Depth EstimationKITTI
Abs Rel0.061
220
Monocular Depth EstimationNYU V2--
174
3D Reconstruction7 Scenes--
128
Monocular Depth EstimationSintel
Abs Rel0.282
127
Depth CompletionNYU V2
RMSE0.269
44
Multi-view Point Map EstimationNRGBD
Accuracy Error0.04
30
Pose EstimationScanNet V2
Avg ATE (cm)0.031
19
Video Depth EstimationETH3D
Relative Error3.7
18
Camera pose estimationSintel
ATE0.124
13
Showing 10 of 15 rows

Other info

GitHub

Follow for update