| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Agent Capability Evaluation | SEAL 0 | Average Score (@8)61.3 | 19 | |
| Reasoning over conflicting evidence | SEAL-0 | Accuracy45.9 | 14 | |
| Complex information-seeking | Seal-0 | Accuracy56.2 | 11 | |
| Deep Search | SEAL 0 | Score41.44 | 11 | |
| Complex Reasoning | Seal-0 | Accuracy (Seal-0)53.4 | 8 | |
| Agent Tool-use and Reasoning | SEAL (test) | Pass@351.97 | 8 | |
| Fact-seeking Question Answering | SEAL-0 | Accuracy10.8 | 4 | |
| Agent | SEAL | Score57.4 | 2 |