AlpaGasus: Training A Better Alpaca with Fewer Data
About
Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Evaluation | MME | -- | 557 | |
| Mathematical Reasoning | MathVista | Score23.9 | 322 | |
| Science Question Answering | ARC Challenge | Accuracy56.4 | 234 | |
| Science Question Answering | ScienceQA | -- | 229 | |
| Multimodal Understanding | SEED-Bench | -- | 203 | |
| Multimodal Evaluation | MMBench | MMB Score34.71 | 118 | |
| Question Answering | ARC Challenge | Normalized Accuracy49.91 | 48 | |
| Hallucination and Visual Reasoning Evaluation | HallusionBench | -- | 37 | |
| General Language Modeling | MMLU, ARC-Challenge, and CommonsenseQA Aggregate | Average Score64.19 | 24 | |
| Language Understanding | MMLU | MMLU Score65.18 | 24 |