Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Agent Task on ScienceWorld Unseen
Loading...
75.67
Average Reward
BPO
25.386
38.4405
51.495
64.5495
Aug 5, 2025
Average Reward
Updated 15d ago
Evaluation Results
Method
Method
Links
Average Reward
BPO
Approach=System-2, Eva...
2025.08
75.67
Deepseek-R1
Approach=System-2, Eva...
2025.08
61.13
ETO
Approach=System-2, Eva...
2025.08
58.16
o3-mini
Approach=System-2, Eva...
2025.08
54.55
MPO
Approach=System-2, Eva...
2025.08
53.24
SFT
Approach=System-2, Eva...
2025.08
51.97
Qwen-3-Thinking
Approach=System-2, Eva...
2025.08
46.99
Llama-3.1-8B-Instruct
Approach=System-1, Eva...
2025.08
28.18
Qwen-2.5-7B-Instruct
Approach=System-1, Eva...
2025.08
27.32
Feedback
Search any
task
Search any
task