BLEURT: Learning Robust Metrics for Text Generation

About

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

Thibault Sellam, Dipanjan Das, Ankur P. Parikh• 2020

Related benchmarks

Task	Dataset	Result
Factual Consistency Evaluation	SummaC	CGS60.8	52
Summarization Evaluation	SummEval	Coherence53.3	41
Factual Consistency Evaluation	QAGS XSUM	Spearman Correlation12.4	39
Factual Consistency Evaluation	QAGS CNNDM	Spearman Correlation43.4	38
Factual Consistency Evaluation	TRUE benchmark	PAWS (AUC-ROC)68.4	37
Factual Consistency Evaluation	SummEval	Spearman Correlation23.6	36
Machine Translation Meta-evaluation	WMT Metrics Shared Task Segment-level 2023 (Primary submissions)	Avg Correlation0.622	33
Factual Consistency Evaluation	FRANK-XSum (FRK-X)	Spearman Correlation13.9	30
Machine Translation Meta-evaluation	MENT ZH-EN	Meta Score56.5	30
Machine Translation Meta-evaluation	MENT EN-ZH	Meta Score56.5	30

Showing 10 of 78 rows

...

Other info

Follow for update

@wizwand_team Discord