GPTScore: Evaluate as You Desire

About

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models. Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently. This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., FLAN-T5-small) to 175B (e.g., GPT3). Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions. This nature helps us overcome several long-standing challenges in text evaluation--how to achieve customized, multi-faceted evaluation without the need for annotated samples. We make our code publicly available at https://github.com/jinlanfu/GPTScore.

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu• 2023

Related benchmarks

Task	Dataset	Result
Summarization Evaluation	SummEval	Avg Spearman Rho0.394	45
Data-to-text evaluation	SFRES	Spearman Correlation0.178	39
Factual Consistency Evaluation	QAGS XSUM	Spearman Correlation22	39
Factual Consistency Evaluation	SummEval	Spearman Correlation0.475	36
Quantitative evaluation of LLM feedback against human gold standards	50 SOC analysis reports (test)	Spearman Correlation (ρ)0.65	30
Dialogue Evaluation Human Correlation	Topical-Chat	Naturalness Pearson (r)0.353	26
Machine Translation Evaluation	WMT 2019 (test)	de-en0.307	25
Data-to-text evaluation	SFHOT	Spearman Correlation (Naturalness)0.141	25
Text Summarization	Newsroom segment-level	Coherence (COH)0.684	15
Text Summarization	Rank19	ACC79.9	15

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord