Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Optimal Transport for LLM Reward Modeling from Noisy Preference

About

Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.

Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Yuan Lu, Hao Wang• 2026

Related benchmarks

TaskDatasetResultRank
Safety EvaluationHarmBench--
148
Reward ModelingHelpSteer (test)
MAE0.087
65
Reward ModelingUltraFeedback (test)
MAE0.145
38
Reward ModelingPKU-SafeRLHF (test)
MAE0.074
36
Safety EvaluationDAN
Safety Score (DAN)0.795
26
RLHF Safety EvaluationFFT
SS0.82
8
Showing 6 of 6 rows

Other info

Follow for update