Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

About

Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.

Zhenchen Tang, Songlin Yang, Zichuan Wang, Bo Peng, Yang Li, Beibei Dong, Jing Dong• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationHuman Evaluation Total
Win Ratio85
10
Visual Instruction FollowingVisual Instruction Total (test)
Avg. Response Length (Words)22.94
6
Text-to-Image GenerationHuman Evaluation (In-Distribution)
Win Ratio (Overall)85
5
Visual Instruction FollowingVisual Instruction In-Distribution (test)--
5
Visual Instruction FollowingVisual Instruction In-Distribution - Simple (test)--
5
Visual Instruction FollowingVisual Instruction In-Distribution - Hard (test)--
5
Visual Instruction FollowingVisual Instruction Out-Of-Distribution (test)--
5
Visual Instruction FollowingVisual Instruction Out-Of-Distribution - Simple (test)--
5
Visual Instruction FollowingVisual Instruction Out-Of-Distribution - Hard (test)--
5
Showing 9 of 9 rows

Other info

Follow for update