LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

About

Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or indirect alignment signals and remain suboptimal for simultaneously supporting multimodal understanding and generation, particularly in settings that require fine-grained language-visual reasoning and controllable generation. In this work, we propose LVRPO, a language-visual reinforcement-based preference optimization framework that explicitly aligns language and visual representations using Group Relative Policy Optimization (GRPO). Instead of introducing additional alignment losses at the representation level, LVRPO directly optimizes multimodal model behaviors through preference-driven reinforcement signals, encouraging consistent and semantically grounded interactions between language and vision across both understanding and generation tasks. This formulation enables effective alignment without requiring auxiliary encoders or handcrafted cross-modal objectives, and naturally extends to diverse multimodal capabilities. Empirically, LVRPO consistently outperforms strong unified-pretraining baselines on a broad suite of benchmarks spanning multimodal understanding, generation, and reasoning.

Shentong Mo, Sukmin Yun• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score91	914
Mathematical Reasoning	MathVista	Score76.2	566
Visual Understanding	MM-Vet	MM-Vet Score69.5	190
Vision Understanding	MMBench	--	141
Visual Perception	MMVP	--	118
Image Editing	GEdit-Bench-EN (full)	G-Score (O)7.67	84
Vision Understanding	MMMU	--	71
Knowledge-grounded reasoning	WISE	Overall Score85	68
Image Editing	GEdit-Bench-CN (Full set)	G_SC7.78	33
Visual Understanding	MME (total)	MME-P Score1.70e+3	18

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord