s1: Simple test-time scaling

About

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand\`es, Tatsunori Hashimoto• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy93	895
Reasoning	BBH	Accuracy67.4	726
Mathematical Reasoning	AIME 2024	Accuracy78.8	370
Mathematical Reasoning	AIME 24	Accuracy73.3	318
Mathematical Reasoning	Minerva	Pass@1 Accuracy26.1	289
Multitask Language Understanding	MMLU	Accuracy73.2	263
Visual Mathematical Reasoning	MathVision	Accuracy54.3	254
Science Reasoning	GPQA	Accuracy63.6	243
Reasoning	MMLU-Pro	Accuracy28.7	241
Mathematical Reasoning	MATH 500	Pass@1 Rate72.6	236

Showing 10 of 133 rows

...

Other info

Code

Follow for update

@wizwand_team Discord