Calibrating LLM-Based Evaluator
About
Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Social Risks (2-class) Evaluation | ValEval Disturb | Accuracy0.8443 | 16 | |
| Social Risks (2-class) Evaluation | ValEval Generalized | Accuracy89.6 | 16 | |
| Social Risks (2-class) Evaluation | ValEval (Original) | Accuracy (Social Risks 2-class)85.2 | 16 | |
| Schwartz Value (3-class) Evaluation | ValEval (Original) | Accuracy (3-class)55.53 | 16 | |
| Schwartz Value (3-class) Evaluation | ValEval Disturb | Accuracy68.49 | 16 | |
| Moral Foundation (3-class) Evaluation | ValEval (Original) | Accuracy51.48 | 16 | |
| Schwartz Value (3-class) Evaluation | ValEval Generalized | Accuracy70.74 | 16 | |
| Moral Foundation (3-class) Evaluation | ValEval Disturb | Accuracy77.25 | 16 | |
| Moral Foundation (3-class) Evaluation | ValEval Generalized | Accuracy38.28 | 16 | |
| Automated essay scoring | CEAMC | QWK10.81 | 7 |