UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking
About
Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging weights.This design retains the robustness and exploitation resistance provided by expert diversity while avoiding the inference overhead of MoE architectures or explicit ensembles. Experiments across multiple base models and preference datasets show that, compared with standard dense RMs, UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Preference Classification | Anthropic HH Harmless (test) | Accuracy58.4 | 22 | |
| Generation quality evaluation | AlpacaFarm | Win Rate36.4 | 12 | |
| Reward Modeling | Anthropic Helpful | Accuracy67.2 | 12 | |
| Reward Modeling | Anthropic Harmless | Accuracy55.2 | 12 | |
| Reward Modeling | WebGPT | Accuracy58.4 | 8 | |
| Preference Classification | Anthropic HH Helpful (test) | Accuracy57.6 | 7 | |
| Preference Classification | WebGPT comparisons (test) | Accuracy60.8 | 7 |