Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PCMind-2.1-Kaiyuan-2B Technical Report

About

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Kairong Luo, Zhenbo Sun, Xinyu Shi, Shengqi Chen, Bowen Yu, Yunyi Chen, Chenyi Dang, Hengtao Tao, Hui Wang, Fangming Liu, Kaifeng Lyu, Wenguang Chen• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval--
850
Mathematical ReasoningMATH
Accuracy30.34
535
Commonsense ReasoningCSQA
Accuracy67.4
366
Physical Commonsense ReasoningPIQA
Accuracy74.37
329
Reading ComprehensionBoolQ
Accuracy78.53
219
General KnowledgeMMLU
MMLU General Knowledge Accuracy53.9
170
Mathematical ReasoningMATH--
162
Science Question AnsweringARC-E
Accuracy82.89
138
Science Question AnsweringARC-C
Accuracy66.1
127
Mathematical ReasoningGSM8K
Accuracy51.33
57
Showing 10 of 22 rows

Other info

Follow for update