Controlling Vision-Language Models for Multi-Task Image Restoration
About
Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Deblurring | RealBlur-J (test) | PSNR20.53 | 226 | |
| Image Deblurring | GoPro | PSNR26.5 | 221 | |
| Image Dehazing | SOTS (test) | PSNR30.12 | 161 | |
| Image Deraining | Rain100L (test) | PSNR35.92 | 161 | |
| Low-light Image Enhancement | LOL | PSNR24.17 | 122 | |
| Dehazing | SOTS | PSNR29.78 | 117 | |
| Deraining | Rain100L | PSNR36.28 | 116 | |
| Low-light Image Enhancement | LOL v1 | PSNR21.94 | 113 | |
| Image Dehazing | SOTS Outdoor | PSNR28.1 | 112 | |
| Denoising | BSD68 sigma=25 | PSNR30.42 | 70 |