OLMoE: Open Mixture-of-Experts Language Models

About

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy68.4	1442
Question Answering	ARC Challenge	Accuracy55.2	906
Multi-task Language Understanding	MMLU	Accuracy50.5	881
Commonsense Reasoning	PIQA	Accuracy80.7	757
Language Modeling	C4 (val)	PPL39.852	737
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy77	711
Code Generation	HumanEval (test)	--	612
Question Answering	ARC Challenge	Accuracy (ARC)48.5	598
Question Answering	ARC Easy	Accuracy76.89	597
Natural Language Inference	RTE	Accuracy71.84	590

Showing 10 of 84 rows

...

Other info

Follow for update

@wizwand_team Discord