Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

About

Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, Dongmei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy81.6
983
Mathematical ReasoningGSM8K (test)
Accuracy88.5
751
Mathematical ReasoningMATH
Accuracy33.3
643
Mathematical ReasoningMATH
Accuracy33
535
Mathematical ReasoningMATH (test)
Overall Accuracy33
433
Mathematical ReasoningMATH500 (test)
Accuracy77.4
381
Mathematical ReasoningSVAMP
Accuracy71.8
368
Mathematical ReasoningGSM8K
Accuracy (GSM8K)81.6
358
Mathematical ReasoningSVAMP (test)
Accuracy64.3
233
Mathematical ReasoningCollegeMATH
Accuracy23.1
161
Showing 10 of 53 rows

Other info

Code

Follow for update