Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DreamOmni2: Multimodal Instruction-based Editing and Generation

About

Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia• 2025

Related benchmarks

TaskDatasetResultRank
Instruction-based Image EditingImgEdit Bench 1.0 (test)
Add Score3.93
37
Multi-image context generationMICON-Bench
Object Score88.95
18
Instruction-based Image EditingEmuEdit-bench (test)
CLIP-src Score0.9035
13
In-context image generationOmniContext 1.0 (test)
Single Instance Character Fidelity7.36
13
Instruction-based Image EditingGEdit-Bench (test)
CLIP Score (Source)92.29
12
Multi-Image CompositionMICON-Bench uniformly sampled subset
Object Score90
10
Unified image generation and editingICE-Bench (test)
Aesthetics Score5.188
9
Object CompositionMICON-Bench
CLIP Score (T2I)0.3454
9
Spatial Geometric ConstraintsMICON-Bench
CLIP Score (T2I)0.3647
9
Multi-image editingMMIE-Bench 1.0 (test)
Addition Score3.92
8
Showing 10 of 18 rows

Other info

Follow for update