Exploiting Diffusion Prior for Real-World Image Super-Resolution
About
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Instance Segmentation | COCO 2017 (val) | APm0.146 | 1144 | |
| Semantic segmentation | ADE20K | mIoU19.6 | 936 | |
| Image Super-resolution | DRealSR | MANIQA0.5601 | 78 | |
| Image Super-resolution | RealSR | PSNR26.27 | 71 | |
| Super-Resolution | Urban bicubic downsampling (test) | PSNR20.52 | 60 | |
| Super-Resolution | DIV2K bicubic downsampling (test) | PSNR21.69 | 60 | |
| Image Super-resolution | DIV2K (val) | LPIPS0.3113 | 59 | |
| Video Super-Resolution | REDS4 (val) | Average PSNR24.79 | 41 | |
| Real-World Super-Resolution | DIV2K real-world degradation (test) | PSNR20.41 | 36 |