Step1X-Edit: A Practical Framework for General Image Editing
About
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | ImgEdit-Bench | Overall Score3.9 | 191 | |
| Image Editing | GEdit-Bench | Semantic Consistency7.66 | 92 | |
| Image Editing | GEdit-Bench English | G_O (Overall Quality)7.48 | 84 | |
| Image Editing | KRIS-Bench | Factual Knowledge Score45.52 | 74 | |
| Image Editing | GEdit-Bench-EN (full) | G-Score (O)6.97 | 66 | |
| Single-image editing | GEdit EN (full) | BG Change7.03 | 42 | |
| Instruction-based Image Editing | ImgEdit Bench 1.0 (test) | Add Score3.91 | 37 | |
| Combined | Multilingual Benchmark | IA Score2.35 | 34 | |
| Image Editing | ImgEdit | ImgEdit3.06 | 31 | |
| Reasoning Image Editing | RiseBench 1.0 (test) | Temporal Score0.00e+0 | 30 |