Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
About
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.4 | 2019 | |
| Visual Question Answering | GQA | Accuracy62 | 1425 | |
| Text-based Visual Question Answering | TextVQA | Accuracy45.6 | 962 | |
| Multimodal Understanding | MMBench | Accuracy79.2 | 847 | |
| Text-to-Image Generation | GenEval | Overall Score80 | 704 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score50 | 631 | |
| Visual Question Answering | GQA | Accuracy61.3 | 524 | |
| Visual Question Answering | ChartQA | -- | 519 | |
| Text-to-Image Generation | GenEval | Overall Score80 | 517 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score50 | 517 |
Showing 10 of 301 rows
...