MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing
About
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instructive image editing | EMU Edit (test) | CLIP Image Similarity0.862 | 83 | |
| Instructive image editing | MagicBrush (test) | CLIP Image0.902 | 53 | |
| Instruction-based Image Editing | Reason50K Temporal Reasoning | CLIP Score0.253 | 16 | |
| Instruction-based Image Editing | Reason50K Story Reasoning | CLIP Score21.6 | 16 | |
| Instruction-based Image Editing | Reason50K Total | CLIP Score0.202 | 16 | |
| Instruction-based Image Editing | Reason50K Physical Reasoning | CLIP Score0.159 | 16 | |
| Instruction-based Image Editing | Reason50K Causal Reasoning | CLIP Score0.178 | 16 |