| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Scientific Agent Task Completion | ScienceAgentBench | Success Rate (SR)43.1 | 40 | |
| Defect Detection | ScienceAgentBench 12 confirmed defects, 102 tasks | Recall Average (RecA)100 | 12 | |
| Scientific Code Generation | ScienceAgentBench | SR25.5 | 10 | |
| Scientific Code Generation | ScienceAgentBench (test) | SR27.5 | 8 | |
| Scientific Agent Task | ScienceAgentBench (test) | Success Rate (SR)18.6 | 6 |