XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

About

Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson• 2020

Related benchmarks

Task	Dataset	Result
Named Entity Recognition	WikiAnn (test)	--	58
Cross-lingual Language Understanding	XTREME	XNLI Accuracy69.1	43
Natural Language Inference	Natural Language Inference (NLI) (test)	Accuracy36.9	39
Named Entity Recognition	MasakhaNER 2.0	--	11
Cross-lingual Paraphrase Identification	PAWS-X	Accuracy (en)0.931	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord