Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
About
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | CoNLL English 2003 (test) | F1 Score92.1 | 135 | |
| Named Entity Recognition | CoNLL Spanish NER 2002 (test) | F1 Score88.9 | 98 | |
| Named Entity Recognition | CoNLL Dutch 2002 (test) | F1 Score91.8 | 87 | |
| Named Entity Recognition | CoNLL German 2003 (test) | F1 Score84.6 | 78 | |
| Model Size Evaluation | Multilingual Language Packages | Model Size (MB)37.3 | 13 | |
| Named Entity Recognition | English OntoNotes (test) | Entity micro-F189.6 | 7 | |
| Neural Pipeline | Universal Dependencies French-GSD v2.5 (test) | Token Coverage99.7 | 7 | |
| Named Entity Recognition | English Ontonotes NER (test) | Relative Processing Time1.36 | 6 | |
| Neural Pipeline | Universal Dependencies Chinese-GSD 2.5 (test) | Token Accuracy97.01 | 5 | |
| Universal Dependencies | English EWT treebank Universal Dependencies (test) | Relative Processing Time4.5 | 5 |