Active Example Selection for In-Context Learning
About
With a handful of demonstration examples, large-scale language models show strong capability to perform various tasks by in-context learning from these examples, without any fine-tuning. We demonstrate that in-context learning performance can be highly unstable across samples of examples, indicating the idiosyncrasies of how language models acquire information. We formulate example selection for in-context learning as a sequential decision problem, and propose a reinforcement learning algorithm for identifying generalizable policies to select demonstration examples. For GPT-2, our learned policies demonstrate strong abilities of generalizing to unseen tasks in training, with a $5.8\%$ improvement on average. Examples selected from our learned policies can even achieve a small improvement on GPT-3 Ada. However, the improvement diminishes on larger GPT-3 models, suggesting emerging capabilities of large language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy86.83 | 1460 | |
| Natural Language Inference | RTE | Accuracy47.5 | 367 | |
| Natural Language Inference | SNLI | Accuracy35 | 174 | |
| Intent Classification | Banking77 (test) | Accuracy84.2 | 151 | |
| Commonsense Question Answering | CommonsenseQA | Accuracy87.55 | 81 | |
| Sentiment Analysis | SST-5 | Accuracy43.69 | 47 | |
| Natural Language Inference | QNLI | Accuracy61.5 | 42 | |
| Natural Language Inference | MNLI | Accuracy70.92 | 36 | |
| Natural Language Inference | MNLI mm | Accuracy29.5 | 30 | |
| Paraphrase Detection | PAWS | Accuracy51.7 | 24 |