CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
About
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pointwise Grading | AlignBench | Pearson (r)0.995 | 38 | |
| Pairwise Comparison | AlignBench | Agreement70.56 | 18 | |
| Text Quality Meta-evaluation | SummEval (Local) | Coherence0.648 | 16 | |
| Text Quality Meta-evaluation | SummEval & Topical-Chat Combined (Overall) | Overall Score59.6 | 16 | |
| Text Summarization | SummEval Global | Coherence71 | 16 | |
| Text Quality Meta-evaluation | Topical-Chat (Local) | Understandability0.664 | 16 | |
| Dialogue Response Generation | Topical-Chat Global | Und76.9 | 16 | |
| Pairwise Comparison | AUTO-J Eval-P | Agreement50.93 | 10 | |
| Pairwise Comparison | LLMEval | Agreement0.5072 | 10 |