MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation
About
Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score87 | 467 | |
| Text-to-Image Generation | GenEval | GenEval Score87 | 277 | |
| Mathematical Reasoning | MathVista (testmini) | Accuracy81.8 | 51 | |
| Instructive image editing | EMU Edit (test) | CLIP Image Similarity0.891 | 46 | |
| Visual Reasoning | MM-Vet | Score79.4 | 34 | |
| Text-to-Image Generation | DPGBench | DPGBench Score87.2 | 31 | |
| Multi-discipline Reasoning | MMMU standard (test) | MMMU Score71.23 | 14 | |
| Multimodal Understanding | MMBench v1.1 (dev) | MMBench Score86.6 | 14 | |
| Visual Text Reasoning and Recognition | OCRBench v2 | Recognition Accuracy68.2 | 14 | |
| Image Editing | ImgEdit | ImgEdit4.06 | 12 |