Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On Representation Redundancy in Large-Scale Instruction Tuning Data Selection

About

Data quality is a crucial factor in large language models training. While prior work has shown that models trained on smaller, high-quality datasets can outperform those trained on much larger but noisy or low-quality corpora, systematic methods for industrial-scale data selection in instruction tuning remain underexplored. In this work, we study instruction-tuning data selection through the lens of semantic representation similarity and identify a key limitation of state-of-the-art LLM encoders: they produce highly redundant semantic embeddings. To mitigate this redundancy, we propose Compressed Representation Data Selection (CRDS), a novel framework with two variants. CRDS-R applies Rademacher random projection followed by concatenation of transformer hidden-layer representations, while CRDS-W employs whitening-based dimensionality reduction to improve representational quality. Experimental results demonstrate that both variants substantially enhance data quality and consistently outperform state-of-the-art representation-based selection methods. Notably, CRDS-W achieves strong performance using only 3.5% of the data, surpassing the full-data baseline by an average of 0.71% across four datasets. Our code is available at https://github.com/tdano1/CRDS.

Youwei Shu, Shaomian Zheng, Dingnan Jin, Wenjie Qu, Ziyao Guo, Qing Cui, Jun Zhou, Jiaheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Logical reasoningBBH
Accuracy83.86
93
General ReasoningBIG-Bench Hard--
68
Code GenerationMBPP
MBPP Accuracy85.2
22
Mathematical ReasoningGSM8K
GSM Score92.55
7
Mathematical Reasoninggsm
GSM Accuracy92.16
7
Multitask Language UnderstandingMMLU
MMLU Score77.99
7
Showing 6 of 6 rows

Other info

Follow for update