MAmmoTH2: Scaling Instructions from the Web

About

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B's (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large-scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new paradigm for building better instruction tuning data.

Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy86.4	1398
Code Generation	HumanEval	Pass@117.68	1043
Mathematical Reasoning	MATH	Accuracy47	882
Multi-task Language Understanding	MMLU	Accuracy68.3	881
Language Understanding	MMLU	Accuracy64.89	844
Instruction Following	AlpacaEval 2.0	Win Rate33.8	722
Mathematical Reasoning	MATH	Accuracy34.1	535
Mathematical Reasoning	MATH (test)	Overall Accuracy47	433
Mathematical Reasoning	SVAMP	Accuracy90.3	403
Mathematical Reasoning	MATH	Accuracy36.7	338

Showing 10 of 35 rows

Other info

Code

Follow for update

@wizwand_team Discord