Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Sequential environment decision making on Crafter BALROG protocol
Loading...
37.9
Peak Task Score (%)
MAGE
25.004
28.352
31.7
35.048
May 11, 2026
Peak Task Score (%)
Number of Achievements Unlocked
Updated 22d ago
Evaluation Results
Method
Method
Links
Peak Task Score (%)
Number of Achievements Unlocked
MAGE
Teacher=Opus, n=5
2026.05
37.9
-
MAGE Opus–Llama (n=5, peak)
Iters=5
2026.05
37.9
4
MAGE
Teacher=Haiku, n=5
2026.05
34.8
-
MAGE Haiku–Llama (n=5, peak)
Iters=5
2026.05
34.8
4
MAGE
Teacher=Sonnet, n=5
2026.05
33.5
-
MAGE Sonnet–Llama (n=4, peak)
Iters=5
2026.05
33.5
4
Standalone Llama-3.1-8B
Protocol=BALROG, seeds=3
2026.05
25.5
-
Standalone Llama-3.1-8B (BALROG ref.)
Iters=1
2026.05
25.5
-
Feedback
Search any
task
Search any
task