Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

About

Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fi\v{s}el, Tanel Alum\"ae, Eleri Aedmaa, Krister Kruusmaa, Kairit Sirts• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy76.6
1085
Commonsense ReasoningWinoGrande
Accuracy61.2
372
Truthfulness EvaluationTruthfulQA
Accuracy36.6
103
Common Sense ReasoningPIQA
Accuracy76
71
General KnowledgeMMLU-Redux
Accuracy66.1
30
Chatbot EvaluationAI Barometer Estonian Chatbot Arena 19.02.2026
Score1.38e+3
20
Question AnsweringBelebele English
Accuracy87
18
Instruction FollowingIFEval EN
Score81.7
12
Academic Question AnsweringNational Exam Estonian
Accuracy63.3
10
Commonsense ReasoningWinogrande Estonian
Accuracy64.4
10
Showing 10 of 19 rows

Other info

Follow for update