Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing

About

Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.

Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen• 2025

Related benchmarks

TaskDatasetResultRank
Instructive image editingEMU Edit (test)
CLIP Image Similarity0.862
83
Instructive image editingMagicBrush (test)
CLIP Image0.902
53
Instruction-based Image EditingReason50K Temporal Reasoning
CLIP Score0.253
16
Instruction-based Image EditingReason50K Story Reasoning
CLIP Score21.6
16
Instruction-based Image EditingReason50K Total
CLIP Score0.202
16
Instruction-based Image EditingReason50K Physical Reasoning
CLIP Score0.159
16
Instruction-based Image EditingReason50K Causal Reasoning
CLIP Score0.178
16
Showing 7 of 7 rows

Other info

Follow for update