Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agent Interaction on ScienceWorld (test)
Loading...
38.18
Success Rate
LLaMA3-8B + MISE
9.5696
16.9973
24.425
31.8527
Apr 13, 2026
Success Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
Success Rate
LLaMA3-8B + MISE
Backbone=LLaMA3-8B-Ins...
2026.04
38.18
Gemma2-9B + MISE
Backbone=Gemma2-9B-Ins...
2026.04
36.8
LLaMA3-8B + RFT
Backbone=LLaMA3-8B-Ins...
2026.04
34.02
Gemma2-9B + PRM
Backbone=Gemma2-9B-Ins...
2026.04
33.85
LLaMA3-8B + PRM
Backbone=LLaMA3-8B-Ins...
2026.04
33.65
LLaMA3-8B + ReAct
Backbone=LLaMA3-8B-Ins...
2026.04
33.1
GPT-4o
Backbone=GPT-4o
2026.04
32.41
Gemma2-9B + PPO
Backbone=Gemma2-9B-Ins...
2026.04
32.19
LLaMA3-8B + online DPO
Backbone=LLaMA3-8B-Ins...
2026.04
32.14
GPT-4o-mini
Backbone=GPT-4o-mini
2026.04
31.98
LLaMA3-8B
Backbone=LLaMA3-8B-Ins...
2026.04
31.79
Qwen2-7B + MISE
Backbone=Qwen2-7B-Inst...
2026.04
28.61
Qwen2-7B + PRM
Backbone=Qwen2-7B-Inst...
2026.04
22.24
Qwen2-7B + PPO
Backbone=Qwen2-7B-Inst...
2026.04
21.85
LLaMA3-8B + PPO
Backbone=LLaMA3-8B-Ins...
2026.04
20.03
Gemma2-9B
Backbone=Gemma2-9B-Ins...
2026.04
11.91
Qwen2-7B
Backbone=Qwen2-7B-Inst...
2026.04
10.67
Feedback
Search any
task
Search any
task