Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Scientific Reasoning on ScienceWorld Seen
Loading...
71.6
Average Reward
Llama-2-7B-Chat + RFT
5.352
22.551
39.75
56.949
Nov 27, 2025
Average Reward
Updated 4d ago
Evaluation Results
Method
Method
Links
Average Reward
Llama-2-7B-Chat + RFT
Adaptation=Fine-tuning...
2025.11
71.6
Co-Evolving Agents
Adaptation=Fine-tuning...
2025.11
69.7
Llama-2-7B-Chat + ETO
Adaptation=Fine-tuning...
2025.11
65.6
Co-Evolving Agents
Backbone=Qwen3-4B-Inst...
2025.11
65.1
Llama-2-7B-Chat + PPO
Adaptation=Fine-tuning...
2025.11
59.4
ETO
Backbone=Qwen3-4B-Inst...
2025.11
58.6
Llama-2-7B-Chat + SFT
Adaptation=Fine-tuning...
2025.11
47.3
SFT
Backbone=Qwen3-4B-Inst...
2025.11
43.6
GPT-4
Adaptation=In-context
2025.11
42.9
GPT-3.5-Turbo
Adaptation=In-context
2025.11
7.9
Feedback
Search any
task
Search any
task