What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

About

Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.

Guangzeng Han, Xiaolei Huang• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval 2.0	Win Rate7.5	752
Question Answering	ARC Challenge	Accuracy (ARC)58.98	631
General Knowledge Evaluation	MMLU	MMLU Accuracy64.9	167
Instruction Following	IFEval (test)	IFEval Score51.19	92
Question Answering	MedQA (test)	Accuracy46.03	67
Question Answering	MedMCQA (test)	--	48
Multi-turn Chat Evaluation	MT-Bench	MT-Bench Score5.28	42
Question Answering	MMLU Med	Accuracy65.33	34
Instruction Following	WizardLM (test)	Score1.308	25
Instruction Following	AlpacaEval GPT-4 (test)	AlpacaEval Win Rate (GPT-4)1.261	18

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord