Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

About

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200$\times$ faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

Gonzalo Martin Garcia, Karim Knaebel, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe• 2024

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationETH3D
AbsRel6.2
117
Surface Normal PredictionNYU V2
Mean Error16.5
100
Depth EstimationScanNet
AbsRel0.058
94
Monocular Depth EstimationDIODE
AbsRel30.2
93
Monocular Depth EstimationKITTI Improved GT (Eigen)
AbsRel0.096
92
Depth EstimationKITTI
AbsRel0.096
92
Monocular Depth EstimationScanNet
AbsRel5.8
64
Depth EstimationDIODE
Delta-1 Accuracy77.6
62
Monocular Depth EstimationNYU
AbsRel5.2
21
Depth EstimationNYU
AbsRel0.054
20
Showing 10 of 33 rows

Other info

Follow for update