Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Many-Shot In-Context Learning

About

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy86.58
1891
Sentiment AnalysisSST-5
Accuracy43.71
106
Commonsense Question AnsweringCommonsenseQA
Accuracy87.47
83
Natural Language InferenceMNLI
Accuracy61.83
36
Image ClassificationFungi
Accuracy1.4
25
Semantic ParsingSMCalFlow
Program Accuracy52.67
22
Financial AnalysisFinancial Analysis Benchmark
FiNER Accuracy72.3
22
Semantic ParsingBreak
Accuracy42.08
18
In-Context Learning9-dataset Average (SST-5, MNLI, CMSQA, HellaSwag, GeoQ, NL2Bash, Break, MTOP, SMCalFlow) (test)
Accuracy66.83
15
ReasoningGeoquery
Accuracy73.36
14
Showing 10 of 25 rows

Other info

Follow for update