WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

About

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy81.6	1424
Mathematical Reasoning	MATH500 (test)	Accuracy77.4	922
Mathematical Reasoning	MATH	Accuracy33.3	882
Mathematical Reasoning	GSM8K (test)	Accuracy88.5	816
Mathematical Reasoning	MATH	Accuracy33	535
Mathematical Reasoning	MATH (test)	Overall Accuracy33	433
Mathematical Reasoning	SVAMP	Accuracy71.8	403
Mathematical Reasoning	GSM8K	Accuracy (GSM8K)81.6	358
Mathematical Reasoning	CollegeMATH	Accuracy23.1	337
Mathematical Reasoning	SVAMP (test)	Accuracy64.3	298

Showing 10 of 66 rows

Other info

Code

Follow for update

@wizwand_team Discord