Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Aligning Language Model Benchmarks with Pairwise Preferences

About

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen• 2026

Related benchmarks

TaskDatasetResultRank
Model Ranking PredictionHelpsteer 70B+ Models Holdout (test)
Pairwise Acc (RM1)77.8
4
Model Ranking PredictionHelpsteer 30B+ Models Holdout (test)
Pairwise Accuracy (RM1)76.5
4
Model Ranking PredictionHelpsteer 13B+ Models Holdout (test)
Acc_pair (RM1 Helpful)74.1
4
Model Ranking PredictionUltraFeedback 70B+ Models Holdout (test)
Pairwise Acc (RM1_Honest)77.4
4
Model Ranking PredictionUltraFeedback 30B+ Models Holdout (test)
Pairwise Acc (RM1_Honest)77.3
4
Model Ranking PredictionUltraFeedback 13B+ Models Holdout (test)
Pairwise Accuracy (RM1_Honest)74.8
4
Pairwise Preference RankingHelpsteer 2% holdout (test)
Pairwise Acc (RM1)86.6
4
Pairwise Preference RankingHelpsteer 5% holdout (test)
Pairwise Accuracy (RM1-Helpful)84.9
4
Pairwise Preference RankingHelpsteer 10% holdout (test)
Pairwise Acc (RM1-Helpful)85.5
4
Pairwise Preference RankingUltraFeedback 2% holdout (test)
Pairwise Acc (RM1-Honest)89.1
4
Showing 10 of 12 rows

Other info

Follow for update