Share your thoughts, 1 month free Claude Pro on usSee more

Code Generation on BigCodeBench (30% held-out tasks)

55.8Success Rate (SR)

APEX-EM (A5: Opus judge)

Updated 3mo ago

Evaluation Results

Method	Links
APEX-EM (A5: Opus judge) 2026.03		55.8	9.9
APEX-EM (A5: GPT 4o as base) 2026.03		53.5	9.9
APEX-EM (EG2: Full memory) 2026.03		51.5	5.6
APEX-EM (EG1: Entity graph) 2026.03		51.2	5.3
MemRL 2026.03		50.8	2.3
APEX-EM (A1: Memory, no judge) 2026.03		50.6	4.7
APEX-EM (R1: Semantic only) 2026.03		50.3	4.4
Self-RAG 2026.03		50	1.5
MemP 2026.03		49.4	0.9
APEX-EM (A3: Judge + iteration) 2026.03		48.8	2.9
No Memory 2026.03		48.5	-
Mem0 2026.03		48.5	0
RAG 2026.03		47.9	0.6
APEX-EM (A2: Memory + judge) 2026.03		46.5	0.6
APEX-EM (A0: No Memory) 2026.03		45.9	-