Memp: Exploring Agent Procedural Memory
About
Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model can also yield substantial performance gains. Code is available at https://github.com/zjunlp/MemP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | -- | 221 | |
| Science Question Answering | GPQA | pass@1 Accuracy74.75 | 85 | |
| Interactive Decision-making | AlfWorld | PICK54.3 | 52 | |
| Mathematical Problem Solving | AIME | AIME Score53.33 | 35 | |
| Embodied decision-making | AlfWorld | Success Rate77.61 | 31 | |
| Interactive web-based shopping tasks | Webshop | Score25.3 | 28 | |
| General Problem Solving | Mixed (AIME, GPQA, HLE, HotpotQA, ALFWorld) | Average Score54.71 | 24 | |
| Humanities Question Answering | HLE | HLE Score11.22 | 24 | |
| Embodied agent | ALFWorld Seen | Average Reward85.7 | 12 | |
| Embodied agent | ALFWorld Unseen | Average Reward77.2 | 12 |