Many-Shot In-Context Learning
About
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy86.58 | 1460 | |
| Commonsense Question Answering | CommonsenseQA | Accuracy87.47 | 81 | |
| Sentiment Analysis | SST-5 | Accuracy43.71 | 47 | |
| Natural Language Inference | MNLI | Accuracy61.83 | 36 | |
| Semantic Parsing | SMCalFlow | Program Accuracy52.67 | 22 | |
| Semantic Parsing | Break | Accuracy42.08 | 18 | |
| Image Classification | Fungi | Accuracy1.4 | 18 | |
| In-Context Learning | 9-dataset Average (SST-5, MNLI, CMSQA, HellaSwag, GeoQ, NL2Bash, Break, MTOP, SMCalFlow) (test) | Accuracy66.83 | 15 | |
| Reasoning | Geoquery | Accuracy73.36 | 14 | |
| Semantic Parsing | MTOP | Accuracy45.32 | 14 |