Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Scientific Reasoning on ScienceWorld Unseen
Loading...
62
Average Reward
Co-Evolving Agents
8.44
22.345
36.25
50.155
Nov 27, 2025
Average Reward
Updated 4d ago
Evaluation Results
Method
Method
Links
Average Reward
Co-Evolving Agents
Adaptation=Fine-tuning...
2025.11
62
Co-Evolving Agents
Backbone=Qwen3-4B-Inst...
2025.11
58.5
Llama-2-7B-Chat + ETO
Adaptation=Fine-tuning...
2025.11
55.5
ETO
Backbone=Qwen3-4B-Inst...
2025.11
55.2
Llama-2-7B-Chat + RFT
Adaptation=Fine-tuning...
2025.11
54.3
Llama-2-7B-Chat + PPO
Adaptation=Fine-tuning...
2025.11
51.7
Llama-2-7B-Chat + SFT
Adaptation=Fine-tuning...
2025.11
41.9
SFT
Backbone=Qwen3-4B-Inst...
2025.11
40.8
GPT-4
Adaptation=In-context
2025.11
38.1
GPT-3.5-Turbo
Adaptation=In-context
2025.11
10.5
Feedback
Search any
task
Search any
task