Instruction Induction: From Few Examples to Natural Language Task Descriptions
About
Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Knowledge Reasoning | MMLU CF | Accuracy71.4 | 55 | |
| Long-context Reasoning | LongBench | Accuracy (LongBench)59 | 45 | |
| Grade School Math Word Problems | GSM8K | Accuracy0.934 | 42 | |
| Multi-hop Question Answering | MuSiQue | Accuracy36.7 | 24 | |
| Advanced Mathematical Reasoning | OlympiadBench | Accuracy12.2 | 18 | |
| General Reasoning | BBH | Accuracy80.8 | 18 | |
| Instruction Induction | Instruction Induction | Avg Execution Score13.23 | 17 | |
| Long-context Reasoning | LongBench | Relative Cost1.13 | 14 | |
| Multi-hop Reasoning | HotpotQA | Relative Cost1.19 | 14 | |
| Multi-hop Reasoning | MuSiQue | Relative Cost1.22 | 14 |