LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models
About
Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy81.5 | 351 | |
| Visual Question Answering | ChartQA | Accuracy81.7 | 239 | |
| Visual Mathematical Reasoning | MathVista | Accuracy60 | 189 | |
| Visual Question Answering | AI2D | Accuracy78.9 | 174 | |
| Mathematical Reasoning | MATH 500 | Accuracy38.6 | 73 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy38.7 | 73 | |
| Image Editing | ImgEdit 1.0 (test) | Add Score4.25 | 17 | |
| Reason-intensive Grounding | Lisa Grounding | P@0.566.7 | 8 | |
| Visual Question Answering | MMMU-Pro | Accuracy32.8 | 6 |