Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

About

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score87
467
Text-to-Image GenerationGenEval
GenEval Score87
277
Mathematical ReasoningMathVista (testmini)
Accuracy81.8
51
Instructive image editingEMU Edit (test)
CLIP Image Similarity0.891
46
Visual ReasoningMM-Vet
Score79.4
34
Text-to-Image GenerationDPGBench
DPGBench Score87.2
31
Multi-discipline ReasoningMMMU standard (test)
MMMU Score71.23
14
Multimodal UnderstandingMMBench v1.1 (dev)
MMBench Score86.6
14
Visual Text Reasoning and RecognitionOCRBench v2
Recognition Accuracy68.2
14
Image EditingImgEdit
ImgEdit4.06
12
Showing 10 of 15 rows

Other info

Follow for update