EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
About
Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy76.6 | 1085 | |
| Commonsense Reasoning | WinoGrande | Accuracy61.2 | 372 | |
| Truthfulness Evaluation | TruthfulQA | Accuracy36.6 | 103 | |
| Common Sense Reasoning | PIQA | Accuracy76 | 71 | |
| General Knowledge | MMLU-Redux | Accuracy66.1 | 30 | |
| Chatbot Evaluation | AI Barometer Estonian Chatbot Arena 19.02.2026 | Score1.38e+3 | 20 | |
| Question Answering | Belebele English | Accuracy87 | 18 | |
| Instruction Following | IFEval EN | Score81.7 | 12 | |
| Academic Question Answering | National Exam Estonian | Accuracy63.3 | 10 | |
| Commonsense Reasoning | Winogrande Estonian | Accuracy64.4 | 10 |