When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift
About
Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Model Transfer | HelpSteer3 (H3) v1 (test) | AOG8.96 | 16 | |
| Reward Model Transfer | UltraFeedback (UF) | AOG7.93 | 16 | |
| Reward Model Transfer | Anthropic Helpful (AH) | AOG8.19 | 16 | |
| Reward Model Transfer | Anthropic Harmless (AHar) | AOG4.62 | 8 | |
| Reward Model Transfer | PKU-SafeRLHF | AOG2.46 | 8 | |
| Reward Model Transfer | RAIL | AOG4.62 | 8 |