Show-o2: Improved Native Unified Multimodal Models
About
This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | Accuracy79.3 | 847 | |
| Text-to-Image Generation | GenEval | Overall Score76 | 704 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score47.1 | 631 | |
| Video Understanding | MVBench | Accuracy55.8 | 563 | |
| Text-to-Image Generation | GenEval | Overall Score76 | 517 | |
| Multimodal Understanding | SEED-Bench | -- | 516 | |
| Text-to-Image Generation | DPG-Bench | Overall Score86.14 | 451 | |
| Text-to-Image Generation | GenEval | GenEval Score76 | 442 | |
| Multimodal Understanding | MMMU | Accuracy48.9 | 437 | |
| Optical Character Recognition | OCRBench | Score32.4 | 433 |