EuroLLM: Multilingual Language Models for Europe

About

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

Pedro Henrique Martins, Patrick Fernandes, Jo\~ao Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, Jos\'e Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, Jos\'e G. C. de Souza, Alexandra Birch, Andr\'e F. T. Martins• 2024

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU	Accuracy28.3	881
Instruction Following	IFEval	--	854
Commonsense Reasoning	WinoGrande	Accuracy57.8	453
Science Question Answering	ARC Challenge	Accuracy35.9	354
Multitask Language Understanding	MMLU-Pro	Accuracy10.9	303
Question Answering	ARC-C	Accuracy31.57	283
Common Sense Reasoning	HellaSwag	Accuracy45.9	213
Commonsense Reasoning	SocialIQA	Accuracy44.8	164
Science Question Answering	ARC Easy	Accuracy71.3	162
Emotional Intelligence	Polish EQ-Bench	Overall Score54.1	106

Showing 10 of 112 rows

...

Other info

Follow for update

@wizwand_team Discord