TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

About

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.

Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, J\"urgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score90	914
Video Understanding	MVBench	--	635
Visual Question Answering	ChartQA	--	620
Text-to-Image Generation	GenEval	Overall Score90	581
Multimodal Understanding	SEED-Bench	--	571
Text-to-Image Generation	DPG-Bench	Overall Score86.8	510
Optical Character Recognition	OCRBench	Score74.3	486
Multi-discipline Multimodal Understanding	MMMU	--	422
Chart Question Answering	ChartQA	Accuracy82.1	404
OCR Evaluation	OCRBench	Score74.3	350

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord