Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unified Personalized Reward Model for Vision Generation

About

Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench--
111
Image Generation AssessmentGenAI-Bench Image (test)
Accuracy73.4
8
Image Generation AssessmentMMRB2 (test)
Accuracy69.2
8
Video Generation AssessmentGenAI-Bench Video (test)
Accuracy82.5
8
Video Generation AssessmentMJBench (test)
Accuracy72
8
Semantic ConsistencyUniGenBench In-domain v1
Overall Score73.95
7
Text-to-Image GenerationUniGenBench++ in-domain
Semantic Consistency73.95
7
Text-to-Image GenerationT2I-CompBench out-of-domain
Semantic Consistency51.37
7
Text-to-Image GenerationGenEval out-of-domain
Semantic Consistency69.62
7
Text-to-Image GenerationOut-of-Domain Evaluation Set
CLIP Score36.25
7
Showing 10 of 10 rows

Other info

GitHub

Follow for update