Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models
About
Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | Human Evaluation Total | Win Ratio85 | 10 | |
| Visual Instruction Following | Visual Instruction Total (test) | Avg. Response Length (Words)22.94 | 6 | |
| Text-to-Image Generation | Human Evaluation (In-Distribution) | Win Ratio (Overall)85 | 5 | |
| Visual Instruction Following | Visual Instruction In-Distribution (test) | -- | 5 | |
| Visual Instruction Following | Visual Instruction In-Distribution - Simple (test) | -- | 5 | |
| Visual Instruction Following | Visual Instruction In-Distribution - Hard (test) | -- | 5 | |
| Visual Instruction Following | Visual Instruction Out-Of-Distribution (test) | -- | 5 | |
| Visual Instruction Following | Visual Instruction Out-Of-Distribution - Simple (test) | -- | 5 | |
| Visual Instruction Following | Visual Instruction Out-Of-Distribution - Hard (test) | -- | 5 |