Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Calibrating LLM-Based Evaluator

About

Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Social Risks (2-class) EvaluationValEval Disturb
Accuracy0.8443
16
Social Risks (2-class) EvaluationValEval Generalized
Accuracy89.6
16
Social Risks (2-class) EvaluationValEval (Original)
Accuracy (Social Risks 2-class)85.2
16
Schwartz Value (3-class) EvaluationValEval (Original)
Accuracy (3-class)55.53
16
Schwartz Value (3-class) EvaluationValEval Disturb
Accuracy68.49
16
Moral Foundation (3-class) EvaluationValEval (Original)
Accuracy51.48
16
Schwartz Value (3-class) EvaluationValEval Generalized
Accuracy70.74
16
Moral Foundation (3-class) EvaluationValEval Disturb
Accuracy77.25
16
Moral Foundation (3-class) EvaluationValEval Generalized
Accuracy38.28
16
Automated essay scoringCEAMC
QWK10.81
7
Showing 10 of 16 rows

Other info

Follow for update