Share your thoughts, 1 month free Claude Pro on usSee more

Code Generation on CodeEval-Pro BigCodeBench-Lite-Pro and HumanEval-Pro (Held-out)

79.2Average Accuracy

Self-invoking

Updated 1mo ago

Evaluation Results

Method	Links
Self-invoking 2026.06		79.2
Self-invoking (w/ subtask solution) 2026.06		73.3
ReAct 2026.06		72.5
ReAct 2026.06		72.5
MEMPROBE 2026.06		71.7
LangMem 2026.06		70.8
Mem0 2026.06		70.8
MEMPROBE 2026.06		70.8
AWM 2026.06		70
AWM 2026.06		70
ReMem 2026.06		70
Mem0 2026.06		69.2
ExpRAG 2026.06		69.2
ExpRAG 2026.06		68.3
ReMem 2026.06		67.5
LangMem 2026.06		66.7
DC-RS 2026.06		65.8
DC-RS 2026.06		60.8