Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Interaction on ScienceWorld (val)
Loading...
44.08
Success Rate
Gemma2-9B + MISE
11.0808
19.6479
28.215
36.7821
Apr 13, 2026
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
Gemma2-9B + MISE
Backbone=Gemma2-9B-Ins...
2026.04
44.08
LLaMA3-8B + MISE
Backbone=LLaMA3-8B-Ins...
2026.04
43.4
Gemma2-9B + PRM
Backbone=Gemma2-9B-Ins...
2026.04
42.12
Gemma2-9B + PPO
Backbone=Gemma2-9B-Ins...
2026.04
40.66
GPT-4o
Backbone=GPT-4o
2026.04
40.65
LLaMA3-8B + PRM
Backbone=LLaMA3-8B-Ins...
2026.04
38.87
LLaMA3-8B + RFT
Backbone=LLaMA3-8B-Ins...
2026.04
38.64
GPT-4o-mini
Backbone=GPT-4o-mini
2026.04
38.6
LLaMA3-8B + ReAct
Backbone=LLaMA3-8B-Ins...
2026.04
37.97
LLaMA3-8B + online DPO
Backbone=LLaMA3-8B-Ins...
2026.04
36.76
LLaMA3-8B
Backbone=LLaMA3-8B-Ins...
2026.04
36.12
Qwen2-7B + MISE
Backbone=Qwen2-7B-Inst...
2026.04
28.87
LLaMA3-8B + PPO
Backbone=LLaMA3-8B-Ins...
2026.04
26.04
Qwen2-7B + PRM
Backbone=Qwen2-7B-Inst...
2026.04
22.6
Qwen2-7B + PPO
Backbone=Qwen2-7B-Inst...
2026.04
22.09
Gemma2-9B
Backbone=Gemma2-9B-Ins...
2026.04
13.49
Qwen2-7B
Backbone=Qwen2-7B-Inst...
2026.04
12.35
Feedback
Search any
task
Search any
task