FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

About

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, Ping Luo• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.1	2056
Text-to-Image Generation	GenEval	Overall Score77	914
Multimodal Understanding	MMBench	Accuracy73.9	887
Multimodal Understanding	MM-Vet	MM-Vet Score38	664
Multimodal Understanding	MMMU	Accuracy34.3	437
Multimodal Understanding	MMMU	MMMU Score34.3	110
Multimodal Perception	MME Perception	Perception Score1.49e+3	99
Visual Question Answering	GQA	GQA Score57.6	75
Multimodal Perception	MME	Perception Score1.49e+3	45
Visual generation	GenEval	Two Obj. Acc85	43

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord