Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ScienceWorld

Benchmarks

Task NameDataset NameSOTA ResultTrend
Interactive Decision MakingScienceWorld Seen
Success Rate88.8
72
Interactive Decision MakingScienceWorld
Success Rate54.2
42
Interactive Decision MakingScienceWorld Unseen
Success Rate85.15
32
Interactive ReasoningScienceWorld (Seen)
Success Rate63.91
31
Mean RewardScienceWorld
Mean Reward0.319
30
Science Simulation Task CompletionScienceWorld Unseen
Success Rate66.3
28
Science Simulation Task CompletionScienceWorld Seen
Success Rate69.8
28
Multi-turn Agentic TaskScienceWorld
Success Rate62
28
Interactive Science ReasoningScienceWorld (test)
Score84.6
27
Science Experiment ExecutionScienceWorld (test)
Success Rate51.51
24
Interactive Decision-makingScienceWorld Unseen (test)
Success Rate58.94
24
Interactive Environment Task CompletionScienceWorld (Unseen)
Average Reward90.1
22
Interactive Environment Task CompletionScienceWorld (Seen)
Average Reward89.5
22
Agentic ReasoningScienceWorld
Original Score82.2
20
Interactive Decision MakingScienceWorld Seen (val)
Average Reward0.7349
20
World ModelingScienceWorld
Matter Score52.8
20
Embodied Agent TaskScienceWorld Unseen
Success Rate70.8
18
Embodied Agent TaskScienceWorld Seen
Success Rate70.9
18
Text-based Task CompletionScienceWorld
Mean Normalised Score32.43
18
Agentic task completionScienceWorld
L0 Score75
18
Agent InteractionScienceWorld (test)
Success Rate38.18
17
Agent InteractionScienceWorld (val)
Success Rate44.08
17
scientific reasoningScienceWorld
Overall Score83.7
16
Interactive Agent TaskScienceWorld
Efficiency Factor11.5
15
Interactive Decision MakingScienceWorld (OOD)
Score9.9
14
Showing 25 of 108 rows