MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation

About

Unified multimodal models aim to integrate understanding and generation within a single framework, yet bridging the gap between discrete semantic reasoning and high-fidelity visual synthesis remains challenging. We present MammothModa2 (Mammoth2), a unified autoregressive-diffusion (AR-Diffusion) framework designed to effectively couple autoregressive semantic planning with diffusion-based generation. Mammoth2 adopts a serial design: an AR path equipped with generation experts performs global semantic modeling over discrete tokens, while a single-stream Diffusion Transformer (DiT) decoder handles high-fidelity image synthesis. A carefully designed AR-Diffusion feature alignment module combines multi-layer feature aggregation, unified condition encoding, and in-context conditioning to stably align AR's representations with the diffusion decoder's continuous latents. Mammoth2 is trained end-to-end with joint Next-Token Prediction and Flow Matching objectives, followed by supervised fine-tuning and reinforcement learning over both generation and editing. With roughly 60M supervised generation samples and no reliance on pre-trained generators, Mammoth2 delivers strong text-to-image and instruction-based editing performance on public benchmarks, achieving 0.87 on GenEval, 87.2 on DPGBench, and 4.06 on ImgEdit, while remaining competitive with understanding-only backbones (e.g., Qwen3-VL-8B) on multimodal understanding tasks. These results suggest that a carefully coupled AR-Diffusion architecture can provide high-fidelity generation and editing while maintaining strong multimodal comprehension within a single, parameter- and data-efficient model.

Tao Shen, Xin Wan, Taicai Chen, Rui Zhang, Junwen Pan, Dawei Lu, Fanding Lei, Zhilin Lu, Yunfei Yang, Chen Cheng, Qi She, Chang Liu, Zhenbang Sun• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score87	517
Text-to-Image Generation	GenEval	GenEval Score87	442
Mathematical Reasoning	MathVista (testmini)	Accuracy81.8	121
Text-to-Image Generation	DPGBench	DPGBench Score87.2	87
Instructive image editing	EMU Edit (test)	CLIP Image Similarity0.891	83
Image Editing	ImgEdit	ImgEdit4.06	62
Visual Reasoning	MM-Vet	Score79.4	40
Multimodal Understanding	MMBench v1.1 (dev)	MMBench Score86.6	35
Image Editing	GEdit-EN	GEdit-EN Score6.6	27
Multi-discipline Reasoning	MMMU standard (test)	MMMU Score71.23	14

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord