Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

About

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

Yann Dubois, Bal\'azs Galambosi, Percy Liang, Tatsunori B. Hashimoto• 2024

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench	Accuracy84.8	166
Reward Modeling	JudgeBench	Accuracy70.1	117
Reward Modeling	PPE Correctness	Accuracy62	45
Reward Modeling	PPE Human	Accuracy64.6	10
Reward Modeling	RM-Bench Easy	Accuracy89.8	10
Reward Modeling	RM-Bench Normal	Accuracy76.6	10
Reward Modeling	RM-Bench Hard	Accuracy0.514	10
Alignment with Human Preferences	Chatbot Arena English-only	Spearman Correlation82.14	9
Correlation analysis with human preferences	Chatbot Arena 15 LLMs after extension	Spearman Correlation0.7632	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord