More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
About
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | OpenBookQA | Accuracy80 | 465 | |
| Question Answering | ARC | Accuracy81 | 154 | |
| Clustering | CLSClusteringS2S | Accuracy89 | 68 | |
| Sentiment Extraction | TweetSentimentExtraction | Accuracy0.83 | 60 | |
| Text Clustering | CLSClusteringS2S id (test) | Accuracy88 | 44 | |
| Text Clustering | ArxivClusteringS2S ood (test) | Accuracy42 | 44 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy32 | 24 | |
| Retrieval | EcomRetrieval in-domain (test) | Accuracy94 | 16 | |
| Summarization | XSUM in-domain (test) | D3 Score20 | 16 | |
| Retrieval | VideoRetrieval out-of-domain (test) | Accuracy100 | 16 |