Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

About

Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. Generative reward models (GRMs) provide greater interpretability than traditional scalar RMs, but they come with a critical trade-off: pairwise methods are hindered by a training-inference mismatch, while pointwise methods require expensive absolute annotations. To bridge this gap, we propose the Preference-aware Task-adaptive Reward Model (PaTaRM). Unlike prior approaches, PaTaRM enables robust pointwise training using readily available pairwise data via a novel Preference-Aware Reward (PAR) mechanism, eliminating the need for explicit rating labels. Furthermore, it incorporates a Task-Adaptive Rubric system that dynamically generates instance-specific criteria for precise evaluation. Extensive experiments demonstrate that PATRM achieves a 8.7% average improvement on RewardBench and RMBench across Qwen3-8B/14B models. Crucially, it boosts downstream RLHF performance by an average relative improvement of 13.6% across IFEval and InFoBench, validating its effectiveness for policy alignment. Our code is available at https://github.com/JaneEyre0530/PaTaRM.

Ai Jian, Jingqing Ruan, Xing Ma, Xiaoyun Zhang, Dailin Li, Weipeng Zhang, Ke Zeng, Xunliang Cai• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingIFEval
IFEval Accuracy90.9
836
Reward ModelingRewardBench
Chat Score91.5
216
Reward ModelingRM-Bench--
137
Mathematical ReasoningGSM-8K
Accuracy94.3
107
Mathematical ReasoningMATH 500
Accuracy95.2
43
Information FollowingInfoBench
Easy Score89.2
21
ReasoningInfoBench
Easy Score89.2
11
Showing 7 of 7 rows

Other info

Follow for update