NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
About
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | Accuracy58.9 | 496 | |
| Text-to-Image Generation | GenEval | Overall Score84 | 467 | |
| OCR Evaluation | OCRBench | Score55.1 | 296 | |
| Multimodal Understanding | MMMU | Accuracy37.1 | 275 | |
| Multi-discipline Multimodal Understanding | MMMU | -- | 266 | |
| Visual Question Answering | ChartQA | Accuracy57.7 | 239 | |
| Multimodal Understanding | MMStar | Accuracy53 | 197 | |
| Text-to-Image Generation | DPG-Bench | Overall Score86 | 173 | |
| Text-to-Image Generation | DPG | Overall Score88.32 | 131 | |
| Visual Question Answering | TextVQA | Accuracy58.9 | 79 |