Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Task on ScienceWorld (Avg Reward, Success Rate)
Loading...
67.5
Success Rate
E3-TIR
42.02
48.635
55.25
61.865
Jul 25, 2025
Sep 6, 2025
Oct 19, 2025
Dec 1, 2025
Jan 13, 2026
Feb 25, 2026
Apr 10, 2026
Success Rate
Average Reward
Updated 6d ago
Evaluation Results
Method
Method
Links
Success Rate
Average Reward
E3-TIR
Backbone=Qwen2.5-3B-In...
2026.04
67.5
-
W2SG with MCTS
2025.07
66.8
58.2
SFT-then-RL
Backbone=Qwen2.5-3B-In...
2026.04
65
-
Ceiling Model
2025.07
63.5
56.9
Zero-RL
Backbone=Qwen2.5-3B-In...
2026.04
63.5
-
W2SG with Tree DPO
2025.07
61.1
55.4
SFT Strong Model
Base Model=Llama-2-13b...
2025.07
61.1
54.9
SFT Strong Model + Best of N
2025.07
60.7
55.3
SFT Strong Model
Base Model=Llama-2-13b...
2025.07
59.2
53.6
SFT Weak Model
Base Model=Llama-2-7b+SFT
2025.07
55.5
41.2
Only SFT
Backbone=Qwen2.5-3B-In...
2026.04
43
-
Feedback
Search any
task
Search any
task