Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

About

There is a consensus that instruction fine-tuning of LLMs requires high-quality data, but what are they? LIMA (NeurIPS 2023) and AlpaGasus (ICLR 2024) are state-of-the-art methods for selecting such high-quality examples, either via manual curation or using GPT-3.5-Turbo as a quality scorer. We show that the extremely simple baseline of selecting the 1,000 instructions with longest responses -- that intuitively contain more learnable information and are harder to overfit -- from standard datasets can consistently outperform these sophisticated methods according to GPT-4 and PaLM-2 as judges, while remaining competitive on the Open LLM benchmarks that test factual knowledge. We demonstrate this for several LLMs (Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1) and datasets (Alpaca-52k, Evol-Instruct-70k). In addition, a lightweight refinement of such long instructions can further improve the abilities of the fine-tuned LLMs, and allows us to obtain competitive results on MT-Bench and the 2nd highest-ranked Llama-2-7B-based model on AlpacaEval 2.0, while training on only 1,000 examples and no extra preference data. We also conduct a thorough analysis of our models to ensure that their enhanced performance is not simply due to GPT-4's preference for longer responses. Overall, our findings suggest that fine-tuning on the longest responses should be the default baseline for any work on instruction fine-tuning. We provide our code at https://github.com/tml-epfl/long-is-more-for-alignment.

Hao Zhao, Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion• 2024

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	Accuracy62.8	844
Reasoning	BBH	Accuracy45.3	726
Instruction Following	AlpacaEval 2.0	Win Rate7.13	722
Instruction Following	MT-Bench	--	287
Logical reasoning	BBH	Accuracy74.53	249
Multi-turn Instruction Following	MT-Bench	MT-Bench Score (GPT-4)4.45	129
General Reasoning	BIG-Bench Hard	--	68
Multilingual Question Answering	TyDiQA	Accuracy63.9	65
Mathematical Reasoning	GSM8K (test)	EM Accuracy78.7	41
General Language Understanding and Reasoning	HuggingFace Open LLM Leaderboard	HellaSwag Accuracy60.58	30

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord