Aligning Language Model Benchmarks with Pairwise Preferences

About

Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen• 2026

Related benchmarks

Task	Dataset	Result
Model Ranking Prediction	Helpsteer 70B+ Models Holdout (test)	Pairwise Acc (RM1)77.8	4
Model Ranking Prediction	Helpsteer 30B+ Models Holdout (test)	Pairwise Accuracy (RM1)76.5	4
Model Ranking Prediction	Helpsteer 13B+ Models Holdout (test)	Acc_pair (RM1 Helpful)74.1	4
Model Ranking Prediction	UltraFeedback 70B+ Models Holdout (test)	Pairwise Acc (RM1_Honest)77.4	4
Model Ranking Prediction	UltraFeedback 30B+ Models Holdout (test)	Pairwise Acc (RM1_Honest)77.3	4
Model Ranking Prediction	UltraFeedback 13B+ Models Holdout (test)	Pairwise Accuracy (RM1_Honest)74.8	4
Pairwise Preference Ranking	Helpsteer 2% holdout (test)	Pairwise Acc (RM1)86.6	4
Pairwise Preference Ranking	Helpsteer 5% holdout (test)	Pairwise Accuracy (RM1-Helpful)84.9	4
Pairwise Preference Ranking	Helpsteer 10% holdout (test)	Pairwise Acc (RM1-Helpful)85.5	4
Pairwise Preference Ranking	UltraFeedback 2% holdout (test)	Pairwise Acc (RM1-Honest)89.1	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord