Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Guiding Instruction-based Image Editing via Multimodal Large Language Models

About

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan• 2023

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingSEED-Bench--
203
Image EditingPIE-Bench
PSNR21.2
116
Multimodal BenchmarkingMMBench
Score6.6
62
Instructive image editingEMU Edit (test)
CLIP Image Similarity0.7456
46
Multimodal BenchmarkingMMMU
Accuracy25.6
15
Instruction-guided image editingMagicBrush single-turn (test)
CLIP Similarity (Image)0.7454
13
Low-light enhancementLow-light enhancement dataset
LPIPS0.491
11
Affective Visual CustomizationL-AVC (test)
FID0.099
10
Description-guided Image EditingMagicBrush multi-turn (test)
L1 Loss0.1912
10
Complex instruction-based image editingCIE-Bench
CLIP-I0.7679
10
Showing 10 of 21 rows

Other info

Follow for update