Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

About

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.4	2056
Visual Question Answering	GQA	Accuracy62	1445
Text-based Visual Question Answering	TextVQA	Accuracy45.6	984
Text-to-Image Generation	GenEval	Overall Score80	914
Multimodal Understanding	MMBench	Accuracy79.2	887
Multimodal Understanding	MM-Vet	MM-Vet Score50	664
Visual Question Answering	ChartQA	--	620
Text-to-Image Generation	GenEval	Overall Score80	581
Multimodal Understanding	SEED-Bench	Accuracy72.1	571
Mathematical Reasoning	MathVista	Score42.5	566

Showing 10 of 366 rows

...

Other info

Code

Follow for update

@wizwand_team Discord