Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
About
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score95 | 467 | |
| Mathematical Reasoning | MathVista (testmini) | Accuracy85.9 | 51 | |
| Instructive image editing | EMU Edit (test) | CLIP Image Similarity0.921 | 46 | |
| Visual Reasoning | MM-Vet | Score82.7 | 34 | |
| Text-to-Image Generation | DPGBench | DPGBench Score91.19 | 31 | |
| Multi-discipline Reasoning | MMMU standard (test) | MMMU Score74.3 | 14 | |
| Multimodal Understanding | MMBench v1.1 (dev) | MMBench Score91.2 | 14 | |
| Visual Text Reasoning and Recognition | OCRBench v2 | Recognition Accuracy76.7 | 14 | |
| Multimodal Understanding | MME standard (test) | MME-P Score1.80e+3 | 7 | |
| Text-to-Image Editing | ImgEdit (test) | Add Score4.66 | 7 |
Showing 10 of 12 rows