Multilingual Universal Sentence Encoder for Semantic Retrieval
About
We introduce two pre-trained retrieval focused multilingual sentence encoding models, respectively based on the Transformer and CNN model architectures. The models embed text from 16 languages into a single semantic space using a multi-task trained dual-encoder that learns tied representations using translation based bridge tasks (Chidambaram al., 2018). The models provide performance that is competitive with the state-of-the-art on: semantic retrieval (SR), translation pair bitext retrieval (BR) and retrieval question answering (ReQA). On English transfer learning tasks, our sentence-level embeddings approach, and in some cases exceed, the performance of monolingual, English only, sentence embedding models. Our models are made available for download on TensorFlow Hub.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Paraphrase Identification | TwitterPara (test) | TURL77.1 | 22 | |
| Question Retrieval | CQADupStack (dev) | Average Precision0.159 | 22 | |
| Question Retrieval | AskUbuntu (dev) | AP59.3 | 22 | |
| Scientific Document Retrieval | SciDocs (dev) | Cite67.1 | 22 | |
| Intent Detection | BANKING 10-shot (test) | Accuracy84.23 | 16 | |
| Intent Detection | HWU 10-shot (test) | Accuracy83.75 | 16 | |
| Intent Detection | CLINC 10-shot (test) | Accuracy90.85 | 16 | |
| Question Response Pairing | BBAI 19 agents 1.0 (test) | Accuracy71.66 | 15 | |
| Retrieval Question Answering | SQuAD | MRR62.5 | 14 | |
| Sentence-level retrieval | ReQA NQ (test) | MRR58.2 | 13 |