UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking

About

Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging weights.This design retains the robustness and exploitation resistance provided by expert diversity while avoiding the inference overhead of MoE architectures or explicit ensembles. Experiments across multiple base models and preference datasets show that, compared with standard dense RMs, UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment.

Lingling Fu, Yongfu Xue• 2025

Related benchmarks

Task	Dataset	Result
Preference Classification	Anthropic HH Harmless (test)	Accuracy58.4	22
Generation quality evaluation	AlpacaFarm	Win Rate36.4	12
Reward Modeling	Anthropic Helpful	Accuracy67.2	12
Reward Modeling	Anthropic Harmless	Accuracy55.2	12
Reward Modeling	WebGPT	Accuracy58.4	8
Preference Classification	Anthropic HH Helpful (test)	Accuracy57.6	7
Preference Classification	WebGPT comparisons (test)	Accuracy60.8	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord