Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Autonomous Exploration on SciWorld
Loading...
7.4
Steps
Qwen2.5-7B+GRPO
3.784
28.192
52.6
77.008
May 15, 2026
Steps
ECC
ΔTask
Updated 16d ago
Evaluation Results
Method
Method
Links
Steps
ECC
ΔTask
Qwen2.5-7B+GRPO
Model Category=Open-So...
2026.05
7.4
15.4
0.3
Qwen3-4B+GRPO
Model Category=Open-So...
2026.05
43.4
12.9
1.7
GPT-4.1
Model Category=Closed-...
2026.05
50.8
38.7
0.2
Qwen2.5-7B
Model Category=Open-So...
2026.05
63.4
32.1
0.6
Qwen3-4B
Model Category=Open-So...
2026.05
87.8
29.3
0.9
LLaMA3.1-8B
Model Category=Open-So...
2026.05
97.5
33.7
2.1
Claude-Opus-4.5
Model Category=Closed-...
2026.05
97.8
89.3
11.7
Feedback
Search any
task
Search any
task