Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

About

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH (test)
Overall Accuracy88
433
Mathematical ReasoningAIME
AIME Accuracy50.7
283
Graduate-level Question AnsweringGPQA
Accuracy44
114
Question AnsweringMMLU-Pro
Accuracy70.1
56
Financial Question AnsweringFinQA (test)
Accuracy60.9
42
Mathematical ReasoningCHAMP standard (test)
Accuracy40.4
36
Science Question AnsweringGPQA (test)
Accuracy65.2
24
Tool UseTask-Bench
Task Completion Rate54
14
Trustworthiness evaluationTrust-Memevo Science Domain
No-Memory78.4
14
Trustworthiness evaluationTrust-Memevo Math Domain
No-Memory Score35.3
14
Showing 10 of 25 rows

Other info

Follow for update