UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

About

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02)on GenEval, out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/TorchUMM

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Multimodal Understanding	MMBench	--	847
Text-to-Image Generation	GenEval	Overall Score82	704
Multimodal Understanding	MM-Vet	MM-Vet Score60.7	631
Multimodal Understanding	MMMU	MMMU Score52.4	102
Visual Question Answering	VQAv2 (test)	VQA Accuracy83.4	82
Text-to-Image Generation	WISE	WISE Score0.43	67
Multimodal Understanding	MME	MME Score1.69e+3	16
Multimodal Understanding	MathVista	Accuracy (Multi-Choice)79.6	16
Multimodal Consistency	UnifiedBench & WISE Composite	Average Score42.82	10

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord