From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
About
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Story Reasoning | XStoryCloze | Accuracy61.86 | 51 | |
| Multiple-Choice Reasoning | Hellaswag bo | Accuracy39.16 | 17 | |
| Reasoning | Xcope bo | Accuracy65.4 | 17 | |
| Reading Comprehension | TibetanQA | Exact Match (EM)59.19 | 11 | |
| Reasoning and Knowledge Assessment | Arc-bo | Accuracy53.67 | 11 | |
| Reasoning and Knowledge Assessment | Xstorycloze bo | Accuracy72.96 | 11 | |
| Machine Translation | Flores-200 zh-bo | BLEU21.56 | 6 | |
| Machine Translation | Flores-200 en-bo | BLEU13.56 | 6 | |
| Multiple-Choice Reasoning | Arc-bo | Accuracy48.39 | 6 | |
| Question Answering | TibetanQA | EM49.43 | 6 |