Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Code Generation on CodeEval-Pro BigCodeBench-Lite-Pro and HumanEval-Pro (Held-out)
Loading...
79.2
Average Accuracy
Self-invoking
60.064
65.032
70
74.968
Jun 1, 2026
Average Accuracy
Updated 1d ago
Evaluation Results
Method
Method
Links
Average Accuracy
Self-invoking
Stream Design=Naive
2026.06
79.2
Self-invoking (w/ subtask solution)
Stream Design=Naive
2026.06
73.3
ReAct
Stream Design=Composit...
2026.06
72.5
ReAct
Stream Design=Naive
2026.06
72.5
MEMPROBE
Stream Design=Composit...
2026.06
71.7
LangMem
Stream Design=Composit...
2026.06
70.8
Mem0
Stream Design=Composit...
2026.06
70.8
MEMPROBE
Stream Design=Naive
2026.06
70.8
AWM
Stream Design=Composit...
2026.06
70
AWM
Stream Design=Naive
2026.06
70
ReMem
Stream Design=Naive
2026.06
70
Mem0
Stream Design=Naive
2026.06
69.2
ExpRAG
Stream Design=Naive
2026.06
69.2
ExpRAG
Stream Design=Composit...
2026.06
68.3
ReMem
Stream Design=Composit...
2026.06
67.5
LangMem
Stream Design=Naive
2026.06
66.7
DC-RS
Stream Design=Naive
2026.06
65.8
DC-RS
Stream Design=Composit...
2026.06
60.8
Feedback
Search any
task
Search any
task