Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

About

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen• 2026

Related benchmarks

TaskDatasetResultRank
Reward Model TransferHelpSteer3 (H3) v1 (test)
AOG8.96
16
Reward Model TransferUltraFeedback (UF)
AOG7.93
16
Reward Model TransferAnthropic Helpful (AH)
AOG8.19
16
Reward Model TransferAnthropic Harmless (AHar)
AOG4.62
8
Reward Model TransferPKU-SafeRLHF
AOG2.46
8
Reward Model TransferRAIL
AOG4.62
8
Showing 6 of 6 rows

Other info

Follow for update