Dialect-robust Evaluation of Generated Text

About

Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects. However, currently, there exists no way to quantify how metrics respond to change in the dialect of a generated utterance. We thus formalize dialect robustness and dialect awareness as goals for NLG evaluation metrics. We introduce a suite of methods and corresponding statistical tests one can use to assess metrics in light of the two goals. Applying the suite to current state-of-the-art metrics, we demonstrate that they are not dialect-robust and that semantic perturbations frequently lead to smaller decreases in a metric than the introduction of dialect features. As a first step to overcome this limitation, we propose a training schema, NANO, which introduces regional and language information to the pretraining process of a metric. We demonstrate that NANO provides a size-efficient way for models to improve the dialect robustness while simultaneously improving their performance on the standard metric benchmark.

Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann• 2022

Related benchmarks

Task	Dataset	Result
Dialect Robustness	EN	Success Rate57	11
Dialect Robustness	PT	Success Rate82	11
Dialect Robustness	ZH	Success Rate74	11
Quality Estimation	Portuguese (pt-BR) dialect sentences (test)	Success Rate86	11
Quality Estimation	Mandarin (zh-CN) dialect sentences (test)	Success Rate84	11
Segment-level agreement with human ratings	WMT 2020 (test)	Agreement (en-cs)73	7
Quality Estimation	WMT 2020 (test)	QE Score (en-cs)71.8	6
Reference-based Quality Estimation	Portuguese (Pt)	R_pb0.85	5
Reference-based Quality Estimation	WMT	Overall Score (en-*)57.6	5
Reference-based Quality Estimation	Chinese (ZH)	R_pb Score0.84	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord