MiniLingua: A Small Open-Source LLM for European Languages
About
Large language models are powerful but often limited by high computational cost, privacy concerns, and English-centric training. Recent progress demonstrates that small, efficient models with around one billion parameters can deliver strong results and enable on-device use. This paper introduces MiniLingua, a multilingual open-source LLM of one billion parameters trained from scratch for 13 European languages, designed to balance coverage and instruction-following capabilities. Based on evaluation results, the instruction-tuned version of MiniLingua outperforms EuroLLM, a model with a similar training approach but a larger training budget, on summarization, classification and both open- and closed-book question answering. Moreover, it remains competitive with more advanced state-of-the-art models on open-ended generation tasks. We release model weights, tokenizer and source code used for data processing and model training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation | Flores-200 (test) | -- | 22 | |
| Reading Comprehension | Belebele | Accuracy26.2 | 20 | |
| Topic Classification | SIB200 | Accuracy24.8 | 8 | |
| Question Answering | MMLU-X | Accuracy24.5 | 8 | |
| Text Summarization | MassiveSum | Score18.7 | 4 | |
| Machine Translation | Flores-200 | COMET0.343 | 4 |