Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

About

With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required for training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR employs a two-step process: first, it ranks instruction pairs using a high-accuracy (84.25%) scoring model aligned with expert preferences; second, it preserves dataset diversity through clustering. In our experiment, CaR efficiently selected a mere 1.96% of Alpaca's IT data, yet the resulting AlpaCaR model surpassed Alpaca's performance by an average of 32.1% in GPT-4 evaluations. Moreover, we find that data selecting is a consistent paradigm whether the pre-trained model is more capable or the model parameters scaling up. Our approach employs compact models with 550M parameters and incurs just 11.2% of the financial outlay of current methods, enhancing its industrial deployability.

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	MT-Bench	MT-Bench Score6.58	215
Faithfulness Hallucination	FollowRAG Faithfulness+	Faithfulness (NaturalQA)45.5	60
Instruction Following	MT-bench v1.0 (test)	MT-Bench Score61.2	52
Instruction Following	FollowRAG Instruction	FollowRAG Instruction Score42.3	30
Instruction Following	FollowRAG Instruction v1 (test)	FollowRAG Instruction Score40.5	30
Factuality Hallucination	LongFact	Facts Score21.1	30
Factuality Hallucination Evaluation	LongFact (test)	Response Score100	30
Factuality Hallucination Evaluation	BioGEN (test)	FactScore47.9	30
Factuality Hallucination	BioGEN	FactScore45.7	30
Instruction Following	Tulu3 Evaluation Suite pool (test)	ARC91.86	25

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord