Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

About

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH (test)	Overall Accuracy88	433
Mathematical Reasoning	AIME	AIME Accuracy50.7	288
Graduate-level Question Answering	GPQA	Accuracy44	215
Question Answering	MMLU-Pro	Accuracy70.1	91
Financial Question Answering	FinQA (test)	Accuracy60.9	57
Multi-task Language Understanding	MMLU-Pro	MMLU Pro Engineering Acc52	41
Multimodal Agent Task	VisualToolBench	Average@441.7	40
Multimodal Agent Task	AgentVista	Average@421.79	40
Coding	LiveCodeBench	Accuracy65	38
Multi-turn embodied reasoning	BabyAI	Success Rate53	37

Showing 10 of 69 rows

Other info

Follow for update

@wizwand_team Discord