Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages

About

Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching 20.65 BLEU on isiXhosa and competing with encoder-decoder baselines over ten times larger. Multilingual task-specific finetuning benefits closely related languages on topic classification, achieving 78.5% macro-F1 on isiXhosa news classification. While MzansiLM adapts effectively to supervised NLU and NLG tasks, few-shot reasoning remains challenging at this model size, with performance near chance even for much larger decoder-only models. We release MzansiText and MzansiLM to provide a reproducible decoder-only baseline and clear guidance on adaptation strategies for South African languages at small scale.

Anri Lombard, Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede, Elan Novick, Francois Meyer, Jan Buys• 2026

Related benchmarks

TaskDatasetResultRank
Part-of-Speech TaggingMasakhaPOS isiXhosa
Token Accuracy30.5
12
Part-of-Speech TaggingMasakhaPOS isiZulu
Token Accuracy37.8
12
Part-of-Speech TaggingMasakhaPOS Setswana
Token Accuracy5.6
12
Named Entity RecognitionMasakhaNER 2.0
Macro-F1 Score26.1
11
Named Entity RecognitionMasakhaNER isiXhosa 2.0
Macro F148.4
11
Named Entity RecognitionMasakhaNER Setswana 2.0
Macro-F1 Score38.9
11
Topic ClassificationMasakhaNEWS isiXhosa
Macro F178.5
11
Topic ClassificationSIB-200
Accuracy (Xho)40.4
11
Intent ClassificationINJONGO Intent
Accuracy (Eng)3.5
11
Topic ClassificationMasakhaNEWS English
Macro-F163.5
11
Showing 10 of 17 rows

Other info

Follow for update