Learning Word Vectors for 157 Languages

About

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov• 2018

Related benchmarks

Task	Dataset	Result
Image Retrieval	Flickr30K	R@135.6	170
Link Prediction	Edinburgh Association Thesaurus (EAT) (test)	Accuracy87	44
Semantic Change Prediction	DatSemShift	Accuracy82	44
Lexical Semantic Similarity	Multi-SimLex	Spearman Correlation0.44	44
Semantic Textual Similarity	SICK Slovak (val)	Pearson Correlation0.498	33
Semantic Textual Similarity	STS Benchmark Slovak (val)	Pearson Correlation0.42	33
Caption Retrieval	Flickr30K	R@147.1	23
Language Modeling	LAMBADA multilingual (test)	LAMBADA Score (DE)127.7	20
Aspect-based Sentiment Classification	19 ASC tasks averaged (test)	Accuracy82.69	20
Language Identification	TCL UTF-8 converted	Accuracy0.947	2

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord