Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
About
There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Understanding | MMLU | Accuracy62.8 | 756 | |
| Reasoning | BBH | Accuracy45.3 | 507 | |
| Logical reasoning | BBH | Accuracy74.53 | 93 | |
| General Reasoning | BIG-Bench Hard | -- | 68 | |
| Multilingual Question Answering | TyDiQA | Accuracy63.9 | 44 | |
| Code Generation | MBPP | MBPP Accuracy78.08 | 22 | |
| Mathematical Reasoning | GSM8K | GSM Score89.02 | 7 | |
| Mathematical Reasoning | gsm | GSM Accuracy85.05 | 7 | |
| Multitask Language Understanding | MMLU | MMLU Score74.77 | 7 |