Step1X-Edit: A Practical Framework for General Image Editing
About
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Editing | ImgEdit-Bench | Overall Score3.9 | 224 | |
| Image Editing | GEdit-Bench | Semantic Consistency7.66 | 102 | |
| Image Editing | KRIS-Bench | Overall Score51.59 | 98 | |
| Image Editing | GEdit-Bench English | G_O (Overall Quality)7.48 | 94 | |
| Instructional Image Editing | CV-Arena 12K examples 1.0 | Elo Rating1.13e+3 | 84 | |
| Image Editing | GEdit-Bench-EN (full) | G-Score (O)7.24 | 84 | |
| Image Editing | ImgEdit | Add Score3.88 | 81 | |
| Reasoning-informed Image Editing | RISE-Bench | Temporal Score0.00e+0 | 64 | |
| Image Editing | ImgEdit | ImgEdit3.06 | 62 | |
| Single-image editing | GEdit EN (full) | BG Change7.03 | 42 |