Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

About

Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.

Pritam Kadasi, Abhishek Upperwal, Mayank Singh• 2026

Related benchmarks

TaskDatasetResultRank
Language UnderstandingAggregate ARC-C, MMLU, HellaSwag, TruthfulQA (test)
Total Score155.2
22
Budgeted subset selectionAlpaca 5% retention
SUM157.2
6
Budgeted subset selectionDolly (5% retention)
SUM153.6
6
Budgeted subset selectionDolly 15% retention (train)
SUM154.1
6
Budgeted subset selectionNI-20 (5% retention)
SUM155.2
5
Budgeted subset selectionNI-20 15% retention (train)
Total Sum154.8
5
Budgeted subset selectionAlpaca 15% retention (train)
Total Sum144.6
5
Showing 7 of 7 rows

Other info

Follow for update