Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Reasoning on ScienceWorld
Loading...
74.5
Seen Accuracy
Co-Evolving Agents
64.1
66.8
69.5
72.2
Nov 27, 2025
Seen Accuracy
Unseen Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Seen Accuracy
Unseen Accuracy
Co-Evolving Agents
Model=Llama-2-13B-chat
2025.11
74.5
65.5
ETO
Model=Llama-2-13B-chat
2025.11
72.6
65.3
SFT
Model=Llama-2-13B-chat
2025.11
64.5
56.1
Feedback
Search any
task
Search any
task