Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Autonomous Exploration on TextCraft
Loading...
8.7
Steps
Qwen2.5-7B+GRPO
5.156
29.078
53
76.922
May 15, 2026
Steps
Error Correction Count (ECC)
Task Deviation (ΔTask)
Updated 16d ago
Evaluation Results
Method
Method
Links
Steps
Error Correction Count (ECC)
Task Deviation (ΔTask)
Qwen2.5-7B+GRPO
Model Category=Open-So...
2026.05
8.7
11.3
2.1
Qwen3-4B+GRPO
Model Category=Open-So...
2026.05
14.5
10.8
0.2
Qwen3-4B
Model Category=Open-So...
2026.05
21.9
20.6
3.4
GPT-4.1
Model Category=Closed-...
2026.05
31.4
57.6
4.3
Qwen2.5-7B
Model Category=Open-So...
2026.05
50.8
15.2
1.1
LLaMA3.1-8B
Model Category=Open-So...
2026.05
65.9
22.1
1.5
Claude-Opus-4.5
Model Category=Closed-...
2026.05
97.3
82.5
7.8
Feedback
Search any
task
Search any
task