PhoBERT: Pre-trained language models for Vietnamese
About
We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at https://github.com/VinAIResearch/PhoBERT
Dat Quoc Nguyen, Anh Tuan Nguyen• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Toxic Speech Detection | ViCTSD | Acc90.78 | 9 | |
| Hate Speech Detection | ViHSD | Acc87.42 | 9 | |
| Machine Reading Comprehension | UIT-ViQuAD 2.0 | EM57.27 | 9 | |
| Natural Language Inference | ViNLI | Accuracy80.67 | 9 | |
| Hate Spans Detection | ViHOS | Accuracy84.92 | 9 | |
| Emotion Recognition | VSMEC | F1 Score65.44 | 8 | |
| Hate Speech Detection | ViHOS | F1 Score77.16 | 8 | |
| Part-of-Speech Tagging | NIIVTB POS | F1 Score79.36 | 8 | |
| Named Entity Recognition | PhoNER_COVID19 (test) | Micro-F194.5 | 6 | |
| Sentiment Analysis | UIT-VIFSD (test) | F1 Score77.52 | 6 |
Showing 10 of 13 rows