KnowCoder-X: Boosting Multilingual Information Extraction via Code
About
Empirical evidence indicates that LLMs exhibit spontaneous cross-lingual alignment. However, although LLMs show promising cross-lingual alignment in Information Extraction (IE), a significant imbalance across languages persists, highlighting an underlying deficiency. To address this, we propose KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for universal IE. Firstly, it standardizes the representation of multilingual schemas using Python classes, ensuring a consistent ontology across different languages. Then, IE across languages is formulated as a unified code generation task. Secondly, we conduct IE cross-lingual alignment instruction tuning on the translated instance prediction task to enhance the model's cross-lingual transferability. During this phase, we also construct a high-quality and diverse bilingual IE parallel dataset with 257k samples, called ParallelNER, synthesized by our proposed robust three-stage pipeline, with manual annotation to ensure quality. Although without training in 29 unseen languages, KnowCoder-X surpasses ChatGPT by 30.17\% and SoTA by 20.03\%, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 64 IE benchmarks in Chinese and English under various settings demonstrate that KnowCoder-X significantly enhances cross-lingual IE transfer through boosting the IE alignment. Our code and dataset are available at: https://github.com/ICT-GoKnow/KnowCoder
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Named Entity Recognition | OntoNotes | F1-score87.91 | 91 | |
| Named Entity Recognition | Conll 2003 | F1 Score94.69 | 86 | |
| Named Entity Recognition | Wnut 2017 | F1 Score68.72 | 79 | |
| Named Entity Recognition | BC5CDR | F1 Score88.46 | 59 | |
| Named Entity Recognition | MIT Restaurant | -- | 50 | |
| Named Entity Recognition | OntoNotes 5 | -- | 44 | |
| Named Entity Recognition | ACE05 | F1 Score87.49 | 38 | |
| Named Entity Recognition | GENIA | F1 Score78.97 | 37 | |
| Named Entity Recognition | WikiAnn | F1 Score84.69 | 32 | |
| Named Entity Recognition | MSRA | F1 Score96.01 | 29 |