Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking

About

Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging weights.This design retains the robustness and exploitation resistance provided by expert diversity while avoiding the inference overhead of MoE architectures or explicit ensembles. Experiments across multiple base models and preference datasets show that, compared with standard dense RMs, UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment.

Lingling Fu, Yongfu Xue• 2025

Related benchmarks

TaskDatasetResultRank
Preference ClassificationAnthropic HH Harmless (test)
Accuracy58.4
22
Generation quality evaluationAlpacaFarm
Win Rate36.4
12
Reward ModelingAnthropic Helpful
Accuracy67.2
12
Reward ModelingAnthropic Harmless
Accuracy55.2
12
Reward ModelingWebGPT
Accuracy58.4
8
Preference ClassificationAnthropic HH Helpful (test)
Accuracy57.6
7
Preference ClassificationWebGPT comparisons (test)
Accuracy60.8
7
Showing 7 of 7 rows

Other info

Follow for update