Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
About
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH (test) | Overall Accuracy88 | 433 | |
| Mathematical Reasoning | AIME | AIME Accuracy50.7 | 283 | |
| Graduate-level Question Answering | GPQA | Accuracy44 | 114 | |
| Question Answering | MMLU-Pro | Accuracy70.1 | 56 | |
| Financial Question Answering | FinQA (test) | Accuracy60.9 | 42 | |
| Mathematical Reasoning | CHAMP standard (test) | Accuracy40.4 | 36 | |
| Science Question Answering | GPQA (test) | Accuracy65.2 | 24 | |
| Tool Use | Task-Bench | Task Completion Rate54 | 14 | |
| Trustworthiness evaluation | Trust-Memevo Science Domain | No-Memory78.4 | 14 | |
| Trustworthiness evaluation | Trust-Memevo Math Domain | No-Memory Score35.3 | 14 |