Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

About

With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.

Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME
AIME Accuracy58.7
283
Graduate-level Question AnsweringGPQA
Accuracy65.9
114
Question AnsweringMMLU-Pro
Accuracy89.1
56
Embodied Task CompletionEB-Habitat
Avg Success Rate46.4
32
Clinical Decision-MakingMIMIC Common IV (test)
Diagnoses Error0.0385
28
Agent Task Completionτ2-BENCH (test)
Average Task Reward0.441
27
Agent Task Completionτ-Bench (test)
Average Task Reward0.645
27
Agent Task CompletionToolSandbox (test)
Avg Task Reward0.632
27
Embodied Instruction FollowingEB-ALFRED 1.0 (test)
Success Rate (Avg)41.6
20
Multi-turn agent taskACEBench multi-turn (test)
Process Accuracy70.3
15
Showing 10 of 26 rows

Other info

Follow for update