Visual Instruction Inversion: Image Editing via Visual Prompting
About
Text-conditioned image editing has emerged as a powerful tool for editing images. However, in many situations, language can be ambiguous and ineffective in describing specific image edits. When faced with such challenges, visual prompts can be a more informative and intuitive way to convey ideas. We present a method for image editing via visual prompting. Given pairs of example that represent the "before" and "after" images of an edit, our goal is to learn a text-based editing direction that can be used to perform the same edit on new images. We leverage the rich, pretrained editing capabilities of text-to-image diffusion models by inverting visual prompts into editing instructions. Our results show that with just one example pair, we can achieve competitive results compared to state-of-the-art text-conditioned image editing frameworks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surface Normal Estimation | Bedroom Images In-domain | L1 Error0.2081 | 11 | |
| Intrinsic Image Decomposition | Bedroom Images In-domain | Albedo MSE0.0145 | 8 | |
| Intrinsic Image Decomposition | Bedroom images Out-of-domain | Albedo MSE0.0246 | 8 | |
| Monocular Depth Estimation | Bedroom Images In-domain | REL34.98 | 8 | |
| Monocular Depth Estimation | Generalization Images Out-of-domain | Relative Error (REL)0.5364 | 8 | |
| Surface Normal Estimation | Generalization Images Out-of-domain | L1 Error0.2448 | 8 | |
| Image Manipulation | Image manipulation Few-shot (In Distribution) | CLIP-Dir15.85 | 7 | |
| Semantic segmentation | Bedroom dataset | Bed Accuracy0.6 | 7 | |
| Image Analogy Generation | InstructPix2Pix (test) | CLIP Directional Score0.1007 | 6 | |
| Image Manipulation | Few-shot image manipulation (Out of Distribution) | CLIP Directional Score14.69 | 6 |