AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
About
Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@179.9 | 850 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy28 | 751 | |
| Multitask Language Understanding | MMLU (test) | Accuracy43.9 | 303 | |
| Code Generation | HumanEval+ | Pass@175.6 | 189 | |
| Code Generation | MBPP+ | Pass@160.2 | 122 | |
| Reasoning | BBH (test) | Accuracy45.6 | 40 | |
| Code Generation | HumanEval-X | Pass@1 (C++)62.2 | 20 | |
| Data Science Code Completion | DS-1000 | Pandas (Pass@1)32 | 9 |