Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Autonomous Exploration on ALFWorld, SciWorld, TextCraft Macro-average
Loading...
12.6
Average ECC
Qwen2.5-7B+GRPO
9.524
30.287
51.05
71.813
May 15, 2026
Average ECC
Delta Task Change
Updated 16d ago
Evaluation Results
Method
Method
Links
Average ECC
Delta Task Change
Qwen2.5-7B+GRPO
Model Category=Open-So...
2026.05
12.6
1.2
Qwen3-4B+GRPO
Model Category=Open-So...
2026.05
18.8
0.8
Qwen2.5-7B
Model Category=Open-So...
2026.05
22.2
0.7
Qwen3-4B
Model Category=Open-So...
2026.05
28.5
2.2
LLaMA3.1-8B
Model Category=Open-So...
2026.05
30.9
1.7
GPT-4.1
Model Category=Closed-...
2026.05
49.3
2
Claude-Opus-4.5
Model Category=Closed-...
2026.05
89.5
8.6
Feedback
Search any
task
Search any
task