Language-agnostic BERT Sentence Embedding

About

While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM) (Conneau and Lample, 2019), dual encoder translation ranking (Guo et al., 2018), and additive margin softmax (Yang et al., 2019a). We show that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba, well above the 65.5% achieved by Artetxe and Schwenk (2019b), while still performing competitively on monolingual transfer learning benchmarks (Conneau and Kiela, 2018). Parallel data mined from CommonCrawl using our best model is shown to train competitive NMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang• 2020

Related benchmarks

Task	Dataset	Result
Sentence Embedding Evaluation	MTEB (test)	Classification Score61.502	55
Multimodal Retrieval	Multi30K (test)	--	35
Value identification	WVS	Accuracy9.97	16
Value identification	PVQ-RR	Accuracy0.1155	16
Value identification	GLOBE	Accuracy9.31	16
Value identification	ValuePrism	Accuracy16.2	16
Clustering	MultiClaim (test)	ARI51.1	15
Clustering	ClaimMatch	ARI0.456	15
Clustering	ClaimCheck	ARI0.751	15
Cross-lingual Semantic Similarity	XL (test)	Spearman's rho72.4	12

Showing 10 of 68 rows

Other info

Follow for update

@wizwand_team Discord