WizardCoder: Empowering Code Large Language Models with Evol-Instruct
About
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy65.06 | 1896 | |
| Code Generation | HumanEval | Pass@173.2 | 1043 | |
| Multi-task Language Understanding | MMLU | Accuracy32.29 | 881 | |
| Code Generation | HumanEval (test) | Pass@173.8 | 612 | |
| Commonsense Reasoning | WinoGrande | Accuracy61.72 | 453 | |
| Code Generation | MBPP (test) | Pass@173.2 | 405 | |
| Code Generation | HumanEval+ | Pass@156.7 | 393 | |
| Mathematical Reasoning | AIME 2025 | Accuracy44.2 | 311 | |
| Code Generation | MBPP+ | Pass@151.9 | 238 | |
| Question Answering | ARC | Accuracy41.81 | 230 |