| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Interactive Reasoning | ScienceWorld (Seen) | Success Rate63.91 | 31 | |
| Interactive Science Reasoning | ScienceWorld (test) | Score84.6 | 27 | |
| Interactive Decision-making | ScienceWorld Unseen (test) | Success Rate58.94 | 24 | |
| Interactive Decision Making | ScienceWorld Unseen | Success Rate77.81 | 23 | |
| Interactive Decision Making | ScienceWorld Seen | Success Rate81.57 | 23 | |
| World Modeling | ScienceWorld | Matter Score52.8 | 20 | |
| Agentic task completion | ScienceWorld | L0 Score75 | 18 | |
| Agent Interaction | ScienceWorld (test) | Success Rate38.18 | 17 | |
| Agent Interaction | ScienceWorld (val) | Success Rate44.08 | 17 | |
| Agent Task | ScienceWorld | Success Rate67.5 | 11 | |
| scientific reasoning | ScienceWorld Unseen | Average Reward58.5 | 10 | |
| scientific reasoning | ScienceWorld Seen | Average Reward71.6 | 10 | |
| Next-action prediction | ScienceWorld | Accuracy50.34 | 8 | |
| Textual Environment Interaction | ScienceWorld | Base Score96.06 | 8 | |
| Interactive Science Simulation | ScienceWorld v1.0 (test) | Task 1-1 (L) Score97.04 | 8 | |
| Interactive Reasoning | ScienceWorld (Unseen) | Success Rate0.5862 | 7 | |
| Scientific Reasoning in Text-based Environments | ScienceWorld (test) | Task 1-1 Score44.8 | 7 | |
| Science simulation and text-based scientific reasoning | ScienceWorld variations (test) | Changes of State: Boiling Success4 | 7 | |
| Text-based reasoning | ScienceWorld | Running Max Return88 | 6 | |
| Interactive Reasoning | ScienceWorld 30 tasks | Score84.7 | 6 | |
| Task 10-2 (L) | ScienceWorld | Mean Score (Task 10-2 (L))44.67 | 5 | |
| Task 10-1 (L) | ScienceWorld | Mean Score100 | 5 | |
| Genetics (Average) | ScienceWorld | Mean Score72.33 | 5 | |
| Task 9-3 (L) | ScienceWorld | Mean Score56.67 | 5 | |
| Task 9-2 (L) | ScienceWorld | Mean Score (Task 9-2 (L))60 | 5 |