PCMind-2.1-Kaiyuan-2B Technical Report
About
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | -- | 850 | |
| Mathematical Reasoning | MATH | Accuracy30.34 | 535 | |
| Commonsense Reasoning | CSQA | Accuracy67.4 | 366 | |
| Physical Commonsense Reasoning | PIQA | Accuracy74.37 | 329 | |
| Reading Comprehension | BoolQ | Accuracy78.53 | 219 | |
| General Knowledge | MMLU | MMLU General Knowledge Accuracy53.9 | 170 | |
| Mathematical Reasoning | MATH | -- | 162 | |
| Science Question Answering | ARC-E | Accuracy82.89 | 138 | |
| Science Question Answering | ARC-C | Accuracy66.1 | 127 | |
| Mathematical Reasoning | GSM8K | Accuracy51.33 | 57 |