Scaling Properties of Diffusion Models for Perceptual Tasks
About
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. To access our code and models, see https://scaling-diffusion-perception.github.io .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | NYU Depth V2 | -- | 177 | |
| Depth Estimation | ScanNet | AbsRel7.7 | 94 | |
| Depth Estimation | DIODE | Delta-1 Accuracy77.2 | 62 | |
| Depth Prediction | ETH3D | AbsRel4.8 | 35 | |
| Optical Flow Prediction | FlyingChairs (val) | EPE3.08 | 11 | |
| Amodal Segmentation | COCO-A (test) | mIoU82.9 | 6 | |
| Metric Depth Estimation | Hypersim | AbsRel13.6 | 4 | |
| Amodal Segmentation | MP3D (test) | mIoU63.9 | 2 | |
| Amodal Segmentation | P2G (test) | mIoU88.6 | 2 |