Entity Projection via Machine Translation for Cross-Lingual NER
About
Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a) leveraging machine translation systems twice: first for translating sentences and subsequently for translating entities; (b) matching entities based on orthographic and phonetic similarity; and (c) identifying matches based on distributional statistics derived from the dataset. Our approach improves upon current state-of-the-art methods for cross-lingual named entity recognition on 5 diverse languages by an average of 4.1 points. Further, our method achieves state-of-the-art F_1 scores for Armenian, outperforming even a monolingual model trained on Armenian source data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | CoNLL Spanish NER 2002 (test) | F1 Score73.5 | 98 | |
| Named Entity Recognition | CoNLL Dutch 2002 (test) | F1 Score69.9 | 87 | |
| Named Entity Recognition | CoNLL NER 2002/2003 (test) | German F1 Score61.5 | 59 | |
| Named Entity Recognition | CoNLL (test) | F1 Score (de)61.5 | 28 | |
| Named Entity Recognition | Spanish (test) | F1 Score73.5 | 15 | |
| Named Entity Recognition | Dutch (test) | F1 Score69.9 | 15 | |
| Named Entity Recognition | CoNLL de 2003 (test) | F1 Score61.5 | 12 | |
| Named Entity Recognition | German (test) | F1 Score61.5 | 9 | |
| Named Entity Recognition | Chinese (test) | F1 Score50.1 | 4 | |
| Named Entity Recognition | Hindi (test) | F1 Score41.7 | 4 |