Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-domain knowledge reasoning on HLE 500-question ablation
Loading...
57.3
Success Rate (Last)
MemRL
17.884
28.117
38.35
48.583
Mar 31, 2026
Success Rate (Last)
CSR Score
Updated 17d ago
Evaluation Results
Method
Method
Links
Success Rate (Last)
CSR Score
MemRL
Backbone=Gemini-3-pro
2026.03
57.3
61.3
MemP
Backbone=Gemini-3-pro
2026.03
52.8
58.2
Mem0
Backbone=Gemini-3-pro
2026.03
51.2
56
RAG
Backbone=Gemini-3-pro
2026.03
50
54.8
Self-RAG
Backbone=Gemini-3-pro
2026.03
48.8
54.8
APEX-EM: Entity graph (A3, E10)
Backbone=Claude Opus 4.5
2026.03
48
53.3
APEX-EM: Full memory (A2, E10)
Backbone=Claude Opus 4.5
2026.03
46.8
52.3
APEX-EM: Semantic only (A1, E10)
Backbone=Claude Opus 4.5
2026.03
45.6
52.3
APEX-EM: Judge + iteration (A5, E10)
Backbone=Claude Opus 4.5
2026.03
40.4
52.9
No Memory
Backbone=Gemini-3-pro
2026.03
35.7
-
APEX-EM: No Memory (A0)
Backbone=Claude Opus 4.5
2026.03
25.2
-
APEX-EM: Memory, no judge (A4, E10)
Backbone=Claude Opus 4.5
2026.03
19.4
47.9
Pass@10
Backbone=Gemini-3-pro
2026.03
-
52.4
Reflexion
Backbone=Gemini-3-pro
2026.03
-
53
Feedback
Search any
task
Search any
task