Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Many-Shot In-Context Learning

About

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy86.58
1460
Commonsense Question AnsweringCommonsenseQA
Accuracy87.47
81
Sentiment AnalysisSST-5
Accuracy43.71
47
Natural Language InferenceMNLI
Accuracy61.83
36
Semantic ParsingSMCalFlow
Program Accuracy52.67
22
Semantic ParsingBreak
Accuracy42.08
18
Image ClassificationFungi
Accuracy1.4
18
In-Context Learning9-dataset Average (SST-5, MNLI, CMSQA, HellaSwag, GeoQ, NL2Bash, Break, MTOP, SMCalFlow) (test)
Accuracy66.83
15
ReasoningGeoquery
Accuracy73.36
14
Semantic ParsingMTOP
Accuracy45.32
14
Showing 10 of 15 rows

Other info

Follow for update