LatentUMM: Dual Latent Alignment for Unified Multimodal Models

About

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	--	887
Multimodal Understanding	MM-Vet	MM-Vet Score67.2	664
Multimodal Understanding	MMMU	MMMU Score53.2	110
Multimodal Understanding and Generation	WISE	Overall Accuracy41.8	65
Multimodal Understanding	MME	MME Score1.70e+3	16
Multimodal Understanding	MathVista	Accuracy (Multi-Choice)80.37	16
Consistency Evaluation	Unified-Bench	CLIP Score89.95	4
Consistency Evaluation	RealUnify GEU	MC Score0.31	4
Multimodal Generation	DPG-Bench	Global Score82.37	3
Multimodal Generation	UEval	Text Score55.38	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord