Show-o2: Improved Native Unified Multimodal Models

About

This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy79.3	847
Text-to-Image Generation	GenEval	Overall Score76	704
Multimodal Understanding	MM-Vet	MM-Vet Score47.1	631
Video Understanding	MVBench	Accuracy55.8	563
Text-to-Image Generation	GenEval	Overall Score76	517
Multimodal Understanding	SEED-Bench	--	516
Text-to-Image Generation	DPG-Bench	Overall Score86.14	451
Text-to-Image Generation	GenEval	GenEval Score76	442
Multimodal Understanding	MMMU	Accuracy48.9	437
Optical Character Recognition	OCRBench	Score32.4	433

Showing 10 of 135 rows

...

Other info

Follow for update

@wizwand_team Discord