InkubaLM: A small language model for low-resource African languages
About
High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available at https://huggingface.co/lelapa to encourage research and development on low-resource languages.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Part-of-Speech Tagging | MasakhaPOS isiXhosa | Token Accuracy0.00e+0 | 12 | |
| Part-of-Speech Tagging | MasakhaPOS isiZulu | Token Accuracy0.00e+0 | 12 | |
| Part-of-Speech Tagging | MasakhaPOS Setswana | Token Accuracy0.00e+0 | 12 | |
| Named Entity Recognition | MasakhaNER isiXhosa 2.0 | Macro F10.1 | 11 | |
| Named Entity Recognition | MasakhaNER 2.0 | Macro-F1 Score0.00e+0 | 11 | |
| Named Entity Recognition | MasakhaNER Setswana 2.0 | Macro-F1 Score0.00e+0 | 11 | |
| Topic Classification | SIB-200 | Accuracy (Xho)8.4 | 11 | |
| Intent Classification | INJONGO Intent | Accuracy (Eng)0.4 | 11 | |
| Topic Classification | MasakhaNEWS English | Macro-F120.3 | 11 | |
| Topic Classification | MasakhaNEWS isiXhosa | Macro F17.4 | 11 |