Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

About

Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective that facilitates the design of metrics for a wide range of language generation tasks and quality aspects. Based on the nature of information change from input to output, we classify NLG tasks into compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). The information alignment, or overlap, between input, context, and output text plays a common central role in characterizing the generation. Using the uniform concept of information alignment, we develop a family of interpretable metrics for various NLG tasks and aspects, often without need of gold reference data. To operationalize the metrics, we train self-supervised models to approximate information alignment as a prediction task. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog. With information alignment as the intermediate representation, we deliver a composable library for easy NLG evaluation and future metric design.

Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric P. Xing, Zhiting Hu• 2021

Related benchmarks

TaskDatasetResultRank
Factual Consistency EvaluationSummaC
CGS76.5
52
Factual Consistency EvaluationQAGS XSUM
Spearman Correlation30.6
39
Factual Consistency EvaluationQAGS CNNDM
Spearman Correlation57.3
38
Factual Consistency EvaluationTRUE benchmark
PAWS (AUC-ROC)63.1
37
Factual Consistency EvaluationSummEval
Spearman Correlation41.7
36
Factual Consistency EvaluationFRANK-XSum (FRK-X)
Spearman Correlation20.4
30
Factual Consistency EvaluationSamSum
Spearman Correlation17.7
30
Factual Consistency EvaluationFRANK CNNDM
Spearman Correlation49.4
30
Factual Consistency EvaluationSUMMEVAL (test)
Pearson CC54.7
22
Factual Consistency EvaluationFRANK CNNDM (test)
PCC54.5
22
Showing 10 of 39 rows

Other info

Follow for update