Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

About

This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.

Solomiia Bilyk, Volodymyr Getmanskyi, Taras Firman• 2026

Related benchmarks

Task	Dataset	Result
Classification (8 classes)	Twitter customer-support requests	Accuracy (8 classes)95.31	8
Event Logical Reasoning	BizFinBench v2	Accuracy51.67	8
PII detection	PUPA	F1 Score59.32	8
Closed-book Question Answering	Ever Young	LLM Score42.08	8
Information Extraction	Campaign-finance filings	Mean per field Accuracy35.9	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord