Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ScienceWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
Interactive ReasoningScienceWorld (Seen)
Success Rate63.91
31
Interactive Science ReasoningScienceWorld (test)
Score84.6
27
Interactive Decision-makingScienceWorld Unseen (test)
Success Rate58.94
24
Interactive Decision MakingScienceWorld Unseen
Success Rate77.81
23
Interactive Decision MakingScienceWorld Seen
Success Rate81.57
23
World ModelingScienceWorld
Matter Score52.8
20
Agentic task completionScienceWorld
L0 Score75
18
Agent InteractionScienceWorld (test)
Success Rate38.18
17
Agent InteractionScienceWorld (val)
Success Rate44.08
17
Agent TaskScienceWorld
Success Rate67.5
11
scientific reasoningScienceWorld Unseen
Average Reward58.5
10
scientific reasoningScienceWorld Seen
Average Reward71.6
10
Next-action predictionScienceWorld
Accuracy50.34
8
Textual Environment InteractionScienceWorld
Base Score96.06
8
Interactive Science SimulationScienceWorld v1.0 (test)
Task 1-1 (L) Score97.04
8
Interactive ReasoningScienceWorld (Unseen)
Success Rate0.5862
7
Scientific Reasoning in Text-based EnvironmentsScienceWorld (test)
Task 1-1 Score44.8
7
Science simulation and text-based scientific reasoningScienceWorld variations (test)
Changes of State: Boiling Success4
7
Text-based reasoningScienceWorld
Running Max Return88
6
Interactive ReasoningScienceWorld 30 tasks
Score84.7
6
Task 10-2 (L)ScienceWorld
Mean Score (Task 10-2 (L))44.67
5
Task 10-1 (L)ScienceWorld
Mean Score100
5
Genetics (Average)ScienceWorld
Mean Score72.33
5
Task 9-3 (L)ScienceWorld
Mean Score56.67
5
Task 9-2 (L)ScienceWorld
Mean Score (Task 9-2 (L))60
5
Showing 25 of 79 rows