Instruction Induction: From Few Examples to Natural Language Task Descriptions

About

Large language models are able to perform a task by conditioning on a few input-output demonstrations - a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space.

Or Honovich, Uri Shaham, Samuel R. Bowman, Omer Levy• 2022

Related benchmarks

Task	Dataset	Result
General Reasoning	BBH	Accuracy80.8	190
Long-context Reasoning	LongBench	Accuracy (LongBench)59	101
Grade School Math Word Problems	GSM8K	Accuracy0.934	66
General Knowledge Reasoning	MMLU CF	Accuracy71.4	64
Multi-hop Question Answering	MuSiQue	Accuracy36.7	24
Advanced Mathematical Reasoning	OlympiadBench	Accuracy12.2	18
Instruction Induction	Instruction Induction	Avg Execution Score13.23	17
Long-context Reasoning	LongBench	Relative Cost1.13	14
Multi-hop Reasoning	HotpotQA	Relative Cost1.19	14
Multi-hop Reasoning	MuSiQue	Relative Cost1.22	14

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord