FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
About
The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy86.1 | 1455 | |
| Multimodal Understanding | MMBench | Accuracy73.9 | 637 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score38 | 531 | |
| Multimodal Understanding | MMMU | Accuracy34.3 | 437 | |
| Multimodal Perception | MME Perception | Perception Score1.49e+3 | 79 | |
| Multimodal Understanding | MMMU | MMMU Score34.3 | 59 | |
| Multimodal Perception | MME | Perception Score1.49e+3 | 43 | |
| Visual Question Answering | GQA | GQA Score57.6 | 37 | |
| Visual generation | GenEval | Single Obj. Acc96 | 31 | |
| Multidisciplinary Knowledge | MMMU | Score34.7 | 21 |