| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Scientific Reasoning | Science Domain In-Domain: SampleQA, GPQA(ALL), HLE | SampleQA Score3.26 | 18 | |
| Reasoning | Science Domain 20 tasks (test) | Total Cost (USD)0.11 | 3 | |
| Multi-agent task routing | Science Domain 1.0 (test) | Total Cost0.59 | 2 |