In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
About
Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | ImgEdit-Bench | Overall Score3.05 | 132 | |
| Image Editing | KRIS-Bench | Factual Knowledge Score0.4699 | 65 | |
| Image Editing | GEdit-Bench | Semantic Consistency4.94 | 46 | |
| Instruction-based Image Editing | ImgEdit Bench 1.0 (test) | Add Score3.58 | 37 | |
| Image-to-Image Translation (Appearance Divergence) | LAION Mini | Structure Similarity96.8 | 20 | |
| Image-to-Image Translation (Appearance Consistency) | LAION Mini | Structure Similarity0.954 | 20 | |
| Document Editing | MiLDEBench 1.0 (test) | Instruction Following Score2.28 | 18 | |
| Single-image editing | GEdit EN (full) | BG Change2.73 | 15 | |
| Image Editing | ImgEdit (test) | Add Score3.58 | 14 | |
| Instruction-based Image Editing | EmuEdit-bench (test) | CLIP-src Score0.8912 | 13 |