Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

About

Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.

Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li• 2026

Related benchmarks

TaskDatasetResultRank
Personalized Reward ModelingPRISM Personalized
Accuracy68.06
44
Personalized Reward ModelingChatbot Arena Personalized
Accuracy75.92
42
Personalized Reward ModelingLamp-QA (OOD)
Arts Score54.3
7
Reward ModelingPersonalRewardBench (test)
Macro Accuracy65.21
6
Personalized LLM Alignment EvaluationPersonalRewardBench (test)--
6
Showing 5 of 5 rows

Other info

GitHub

Follow for update