Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
About
This paper studies Automated Instruction Revision (AIR), a rule-induction-based method for adapting large language models (LLMs) to downstream tasks using limited task-specific examples. We position AIR within the broader landscape of adaptation strategies, including prompt optimization, retrieval-based methods, and fine-tuning. We then compare these approaches across a diverse benchmark suite designed to stress different task requirements, such as knowledge injection, structured extraction, label remapping, and logical reasoning. The paper argues that adaptation performance is strongly task-dependent: no single method dominates across all settings. Across five benchmarks, AIR was strongest or near-best on label-remapping classification, while KNN retrieval performed best on closed-book QA, and fine-tuning dominated structured extraction and event-order reasoning. AIR is most promising when task behavior can be captured by compact, interpretable instruction rules, while retrieval and fine-tuning remain stronger in tasks dominated by source-specific knowledge or dataset-specific annotation regularities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification (8 classes) | Twitter customer-support requests | Accuracy (8 classes)95.31 | 8 | |
| Event Logical Reasoning | BizFinBench v2 | Accuracy51.67 | 8 | |
| PII detection | PUPA | F1 Score59.32 | 8 | |
| Closed-book Question Answering | Ever Young | LLM Score42.08 | 8 | |
| Information Extraction | Campaign-finance filings | Mean per field Accuracy35.9 | 8 |