When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

About

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen• 2026

Related benchmarks

Task	Dataset	Result
Reward Model Transfer	HelpSteer3 (H3) v1 (test)	AOG8.96	16
Reward Model Transfer	UltraFeedback (UF)	AOG7.93	16
Reward Model Transfer	Anthropic Helpful (AH)	AOG8.19	16
Reward Model Transfer	Anthropic Harmless (AHar)	AOG4.62	8
Reward Model Transfer	PKU-SafeRLHF	AOG2.46	8
Reward Model Transfer	RAIL	AOG4.62	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord