Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

About

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy67
496
Multimodal UnderstandingMM-Vet
MM-Vet Score52.2
418
Multimodal UnderstandingMMBench--
367
Multi-discipline Multimodal UnderstandingMMMU--
266
Chart Question AnsweringChartQA
Accuracy76.2
229
Visual Question AnsweringAI2D
Accuracy72.4
174
Document Visual Question AnsweringDocVQA
ANLS87.1
164
Multimodal UnderstandingMMMU (test)
MMMU Score41.9
86
Infographic Visual Question AnsweringInfoVQA
Accuracy56.3
40
Multi-modal Vision-Language UnderstandingMMVet
Score42.4
38
Showing 10 of 15 rows

Other info

Follow for update