Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Editor to Dense Geometry Estimator

About

Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu• 2025

Related benchmarks

TaskDatasetResultRank
Affine-invariant depth estimationETH3D
AbsRel3.8
59
Affine-invariant depth estimationNYU V2
AbsRel4.1
59
Affine-invariant depth estimationScanNet
AbsRel4.4
58
Affine-invariant depth estimationKITTI Outdoor
AbsRel6.6
46
Surface Normal EstimationNYU V2
Mean Angular Error16.2
33
Affine-invariant depth estimationDIODE Various
AbsRel22.8
27
Surface Normal EstimationiBIMS-1
MAE15.1
17
Affine-invariant depth estimationConsolidated (KITTI, NYUv2, ETH3D, ScanNet, DIODE)--
16
Surface Normal EstimationScanNet Indoor
Mean Error13.8
10
Surface Normal EstimationSintel Outdoor
Mean Error (MeanErr)31.2
8
Showing 10 of 12 rows

Other info

Follow for update