Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models
About
We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scientific Reasoning | GPQA | Accuracy82.2 | 55 | |
| Deep search | HLE text only | Score13.89 | 14 | |
| Deep search | xBench DeepSearch (05) | Score33 | 14 | |
| Deep search | GAIA Text-Only | Score0.1942 | 14 | |
| Deep search | Browse Comp | Score0.79 | 14 | |
| Deep search | Browse Comp ZH | Score3.09 | 14 | |
| Deep search | SEAL 0 | Score12.61 | 11 | |
| Preference Modeling | Arena-Hard v2 | Win Rate60 | 9 | |
| Writing capability evaluation | WritingBench November 2025 (official leaderboard) | Overall Score79.03 | 9 | |
| Deep search | xBench DeepSearch-10 | Score11 | 8 |