Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Code Generation on BigCodeBench Instruct Full (train)
Loading...
83.3
Last SR
APEX-EM + Opus Judge (A5, E10)
46.484
56.042
65.6
75.158
Mar 31, 2026
Last SR
CSR
Updated 18d ago
Evaluation Results
Method
Method
Links
Last SR
CSR
APEX-EM + Opus Judge (A5, E10)
Backbone=Sonnet 4.5, E...
2026.03
83.3
84
APEX-EM (EG2, E10)
Backbone=Sonnet 4.5, E...
2026.03
81.1
82.7
APEX-EM (E10)
Backbone=GPT-4o, Epoch...
2026.03
81.1
81.5
MemRL
Backbone=GPT-4o
2026.03
59.5
62.7
MemP
Backbone=GPT-4o
2026.03
57.8
60.2
No Memory (A0)
Backbone=Sonnet 4.5, E...
2026.03
53.9
-
Mem0
Backbone=GPT-4o
2026.03
53
57.7
Self-RAG
Backbone=GPT-4o
2026.03
50
55.8
No Memory
Backbone=GPT-4o
2026.03
48.5
-
RAG
Backbone=GPT-4o
2026.03
47.9
53
Pass@10
Backbone=GPT-4o
2026.03
-
57.7
Reflexion
Backbone=GPT-4o
2026.03
-
58.2
Feedback
Search any
task
Search any
task