Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

About

In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.

Ziyun Zeng, David Junhao Zhang, Wei Li, Mike Zheng Shou• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score77	704
Text-to-Image Generation	MJHQ-30K	Overall FID5.5	239
Multimodal Understanding	MM-VET (test)	Total Score63.2	120
Multimodal Understanding	MMMU (test)	MMMU Score53.1	112
Image Editing	GEdit-Bench-EN (full)	G-Score (O)6.18	84
Multimodal Understanding	MMBench (test)	--	67
Multimodal Understanding	MME Perception (test)	Perception Score1.57e+3	31
Multimodal Understanding	SEED Benchmark (test)	Avg Score (All)73.8	15
Image Editing	GEdit-Bench-EN Intersection	SC Score6.91	10
Image Editing	MagicBrush (test)	L1 Error0.065	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord