YuLan-Mini: An Open Data-efficient Language Model

About

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly capable base model with 2.42B parameters that achieves top-tier performance among models of similar parameter scale. Our pre-training approach focuses on enhancing training efficacy through three key technical contributions: an elaborate data pipeline combines data cleaning with data schedule strategies, a robust optimization method to mitigate training instability, and an effective annealing approach that incorporates targeted data selection and long context training. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of the data composition for each training phase. Project details can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.

Yiwen Hu, Huatong Song, Jia Deng, Jiapeng Wang, Jie Chen, Kun Zhou, Yutao Zhu, Jinhao Jiang, Zican Dong, Wayne Xin Zhao, Ji-Rong Wen• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	--	1581
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy68.56	897
Physical Commonsense Reasoning	PIQA	Accuracy76.22	724
Commonsense Reasoning	PIQA	Accuracy57.4	400
Common Sense Reasoning	COPA	Accuracy65	288
Math Reasoning	GSM8K	Accuracy (GSM8K)1.5	190
Mathematical Reasoning	GSM-PLUS	Accuracy43.71	162
Code Generation	EvalPlus	Pass@162.25	118
Code Reasoning	HumanEval	--	70
STEM Knowledge	MMLU STEM	Accuracy44.12	43

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord