Kakugo: Distillation of Low-Resource Languages into Small Language Models
About
We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.
Peter Devine, Mardhiyah Sanni, Farid Adilazuarda, Julieta Gil Loizaga, Barry Haddow• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction following and reasoning | Low-resource languages evaluation suite (am, arz, ars, as, ast, az, ba, bn, bo, ceb, cv, cy, fo, ga, gd, gl, gn, ha, ht, ig, jv, kmr, sdh, ky, lb, lo, lus, mg, mi, mn, mt, ny, oc, pap, ps, rn, rw, sd, si, sm, sn, st, su, sw, te, tg, ti, tk, tt, ug, xh, yi, yo, zu) | Wins5 | 54 | |
| Machine Translation | FLORES xx→en (test) | -- | 38 | |
| Reading Comprehension | Belebele | -- | 20 | |
| Machine Translation | FLORES en->xx | -- | 16 | |
| Topic Classification | SIB200 | -- | 8 | |
| Multitask Language Understanding | GlobalMMLU | -- | 6 |
Showing 6 of 6 rows