Many-Shot In-Context Learning

About

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy86.58	1896
Sentiment Analysis	SST-5	Accuracy43.71	123
Commonsense Question Answering	CommonsenseQA	Accuracy87.47	92
Function Calling	BFCL Multi-Turn v3	Overall Accuracy64.7	69
Natural Language Inference	MNLI	Accuracy61.83	36
Image Classification	Fungi	Accuracy1.4	25
Semantic Parsing	SMCalFlow	Program Accuracy52.67	22
Financial Analysis	Financial Analysis Benchmark	FiNER Accuracy72.3	22
Agent Task	AppWorld normal (test)	TGC64.3	20
Semantic Parsing	Break	Accuracy42.08	18

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord