Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning
About
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy86.72 | 1891 | |
| Natural Language Inference | RTE | Accuracy56.3 | 448 | |
| Reading Comprehension | BoolQ | Accuracy77.6 | 279 | |
| Topic Classification | AG-News | Accuracy67.3 | 225 | |
| Common Sense Reasoning | COPA | Accuracy83 | 197 | |
| Natural Language Inference | SNLI | Accuracy43.5 | 180 | |
| Sentiment Analysis | SST-2 | Accuracy88.8 | 165 | |
| Text Classification | SST-2 | Accuracy90.3 | 125 | |
| Text Classification | AGNews | Accuracy61 | 119 | |
| Topic Classification | DBpedia | Accuracy75.5 | 117 |