PCMind-2.1-Kaiyuan-2B Technical Report

About

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1043
Physical Commonsense Reasoning	PIQA	Accuracy74.37	696
Mathematical Reasoning	MATH	Accuracy30.34	535
Commonsense Reasoning	CSQA	Accuracy67.4	366
Mathematical Reasoning	MATH	--	338
General Knowledge	MMLU	MMLU General Knowledge Accuracy53.9	307
Reading Comprehension	BoolQ	Accuracy78.53	279
Science Question Answering	ARC-C	Accuracy66.1	261
Science Question Answering	ARC-E	Accuracy82.89	240
Commonsense Reasoning	CSQA	CSQA Accuracy37.84	195

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord