Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Interactive Scientific Exploration on ScienceWorld standard 30-task protocol
Loading...
73.72
Average Score (Short)
SDP
42.6344
50.7047
58.775
66.8453
May 12, 2026
Average Score (Short)
Average Score (Medium)
Average Score (Long)
Average Score (Overall)
Updated 20d ago
Evaluation Results
Method
Method
Links
Average Score (Short)
Average Score (Medium)
Average Score (Long)
Average Score (Overall)
SDP
LLM=GPT-4, Training st...
2026.05
73.72
53.5
50.41
59.16
Reflexion
LLM=GPT-4, Training st...
2026.05
71.47
35.43
30.17
45.34
Plan-and-Act
LLM=GPT-4, Training st...
2026.05
60.52
46.43
34.77
47.86
CoT
LLM=GPT-4, Training st...
2026.05
49.54
47.87
23.09
39.23
ReAct
LLM=GPT-4, Training st...
2026.05
48.79
44.01
21.07
36.43
EVOAGENT
LLM=GPT-4, Training st...
2026.05
48.67
36.17
11.38
30.42
SayCan
LLM=GPT-4, Training st...
2026.05
43.83
36.58
23.65
33.82
Feedback
Search any
task
Search any
task