Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Code Generation on BigCodeBench (30% held-out tasks)
Loading...
55.8
Success Rate (SR)
APEX-EM (A5: Opus judge)
45.504
48.177
50.85
53.523
Mar 31, 2026
Success Rate (SR)
Delta Score
Updated 17d ago
Evaluation Results
Method
Method
Links
Success Rate (SR)
Delta Score
APEX-EM (A5: Opus judge)
Type=All + Opus judge,...
2026.03
55.8
9.9
APEX-EM (A5: GPT 4o as base)
Type=All, Base Model=G...
2026.03
53.5
9.9
APEX-EM (EG2: Full memory)
Type=Rich feedback, it...
2026.03
51.5
5.6
APEX-EM (EG1: Entity graph)
Type=Rich feedback, it...
2026.03
51.2
5.3
MemRL
Type=Q-value episodic...
2026.03
50.8
2.3
APEX-EM (A1: Memory, no judge)
Type=Memory, binary si...
2026.03
50.6
4.7
APEX-EM (R1: Semantic only)
Type=Rich feedback, it...
2026.03
50.3
4.4
Self-RAG
Type=Critique-filtered...
2026.03
50
1.5
MemP
Type=Procedural distil...
2026.03
49.4
0.9
APEX-EM (A3: Judge + iteration)
Type=Memory, binary, i...
2026.03
48.8
2.9
No Memory
Type=Single-shot LLM,...
2026.03
48.5
-
Mem0
Type=Entity-centric me...
2026.03
48.5
0
RAG
Type=Semantic retrieva...
2026.03
47.9
0.6
APEX-EM (A2: Memory + judge)
Type=Memory, rich feed...
2026.03
46.5
0.6
APEX-EM (A0: No Memory)
Type=Sonnet 4.5, singl...
2026.03
45.9
-
Feedback
Search any
task
Search any
task