Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

About

Scaling reasoning capabilities beyond traditional domains such as math and coding is hindered by the lack of diverse and high-quality questions. To overcome this limitation, we introduce a scalable approach for generating diverse and challenging reasoning questions, accompanied by reference answers. We present NaturalReasoning, a comprehensive dataset comprising 2.8 million questions that span multiple domains, including STEM fields (e.g., Physics, Computer Science), Economics, Social Sciences, and more. We demonstrate the utility of the questions in NaturalReasoning through knowledge distillation experiments which show that NaturalReasoning can effectively elicit and transfer reasoning capabilities from a strong teacher model. Furthermore, we demonstrate that NaturalReasoning is also effective for unsupervised self-training using external reward models or self-rewarding. To foster future work, we publicly release NaturalReasoning at https://huggingface.co/datasets/facebook/natural_reasoning.

Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, Xian Li• 2025

Related benchmarks

TaskDatasetResultRank
Math ReasoningGSM8K
Accuracy (GSM8K)81.5
131
Knowledge ReasoningMMLU-Pro--
120
WritingWritingBench
Score36.77
74
Language UnderstandingMMLU-Pro
MMLU-Pro Accuracy47.96
60
Physics ReasoningPublic Physics Benchmarks (GPQA, SciBench, PhysReason) (test)
GPQA Accuracy36.82
21
Open-ended writingWritingBench
Score36.77
20
Instruction FollowingIFEval
Score (%)28.15
18
Code GenerationMBPP+
AVG Score61.15
17
Code GenerationHumanEval+
Score32.93
5
Aggregate General PerformanceARES Evaluation Suite
Average Score45.91
5
Showing 10 of 12 rows

Other info

Follow for update