Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

About

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score95
467
Mathematical ReasoningMathVista (testmini)
Accuracy85.9
51
Instructive image editingEMU Edit (test)
CLIP Image Similarity0.921
46
Visual ReasoningMM-Vet
Score82.7
34
Text-to-Image GenerationDPGBench
DPGBench Score91.19
31
Multi-discipline ReasoningMMMU standard (test)
MMMU Score74.3
14
Multimodal UnderstandingMMBench v1.1 (dev)
MMBench Score91.2
14
Visual Text Reasoning and RecognitionOCRBench v2
Recognition Accuracy76.7
14
Multimodal UnderstandingMME standard (test)
MME-P Score1.80e+3
7
Text-to-Image EditingImgEdit (test)
Add Score4.66
7
Showing 10 of 12 rows

Other info

Follow for update