Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

About

Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.

Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, Jing Dong• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	Human Evaluation Total	Win Ratio85	10
Visual Instruction Following	Visual Instruction Total (test)	Avg. Response Length (Words)22.94	6
Text-to-Image Generation	Human Evaluation (In-Distribution)	Win Ratio (Overall)85	5
Visual Instruction Following	Visual Instruction In-Distribution (test)	--	5
Visual Instruction Following	Visual Instruction In-Distribution - Simple (test)	--	5
Visual Instruction Following	Visual Instruction In-Distribution - Hard (test)	--	5
Visual Instruction Following	Visual Instruction Out-Of-Distribution (test)	--	5
Visual Instruction Following	Visual Instruction Out-Of-Distribution - Simple (test)	--	5
Visual Instruction Following	Visual Instruction Out-Of-Distribution - Hard (test)	--	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord