Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

About

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present APEX-EM, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a structured experience representation encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a Plan-Retrieve-Generate-Iterate-Ingest (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a dual-outcome Experience Memory with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench, KGQAGen-10k, and Humanity's Last Exam using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6% accuracy versus 41.3% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9%). On BigCodeBench, it reaches 83.3% SR from a 53.9% baseline (+29.4pp), exceeding MemRL's +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0% from 25.2% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha• 2026

Related benchmarks

TaskDatasetResultRank
Structured query generationKGQAGen-10k 2,000-question sample (train)
LASM Accuracy89.6
20
Code GenerationBigCodeBench (BCB) 342 tasks 30% held-out (unseen)
Success Rate (SR)55.8
15
Knowledge Graph Question AnsweringKGQAGen-10k 1,079 unseen questions (test)
LASM Accuracy73.7
15
Multi-domain knowledge reasoningHLE 500-question ablation
Success Rate (Last)48
12
Code GenerationBigCodeBench Instruct Full (train)
Last SR83.3
10
Database task executionLifelong-DB (held-out)--
6
Embodied AI task executionALFWorld (held-out)--
6
Operating System task executionLifelong-OS (held-out)--
6
Showing 8 of 8 rows

Other info

Follow for update