EditCLIP: Representation Learning for Image Editing
About
We introduce EditCLIP, a novel representation-learning approach for image editing. Our method learns a unified representation of edits by jointly encoding an input image and its edited counterpart, effectively capturing their transformation. To evaluate its effectiveness, we employ EditCLIP to solve two tasks: exemplar-based image editing and automated edit evaluation. In exemplar-based image editing, we replace text-based instructions in InstructPix2Pix with EditCLIP embeddings computed from a reference exemplar image pair. Experiments demonstrate that our approach outperforms state-of-the-art methods while being more efficient and versatile. For automated evaluation, EditCLIP assesses image edits by measuring the similarity between the EditCLIP embedding of a given image pair and either a textual editing instruction or the EditCLIP embedding of another reference image pair. Experiments show that EditCLIP aligns more closely with human judgments than existing CLIP-based metrics, providing a reliable measure of edit quality and structural preservation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dehazing | SOTS | -- | 154 | |
| Super-Resolution | FFHQ 1k | FID77.64 | 23 | |
| Image Denoising | BSD400 (test) | FID99 | 16 | |
| Image Colorization | DIV2K | FID138.3 | 16 | |
| Image Deblurring | FFHQ 1k | FID78.75 | 16 | |
| Image Deraining | Rain100L | FID174.9 | 13 | |
| Super-Resolution | User Study SR samples | Perceptual Score0.00e+0 | 5 | |
| Image Deraining | User Study DeRain samples | Perceptual Score4.5 | 4 |