| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Web Search | BrowseComp-Plus | Pass@353.02 | 60 | |
| Deep research agents / Multi-step reasoning | BrowseComp-Plus OOD | Success Rate (SR)54.6 | 24 | |
| Complex Tasks | BrowseComp+ Complex Tasks 2nd Pass | Accuracy89 | 16 | |
| Long-context reasoning | BrowseComp+ 1K documents | Accuracy94.6 | 16 | |
| Web Browsing and Tool Use | BrowseComp+ original (test) | Performance (%)38.72 | 15 | |
| Complex Task Solving | BrowseComp+ Naive Stream | Accuracy (1st-Q)55 | 8 | |
| Complex Task Solving | BrowseComp+ Compositional Stream | Accuracy (1st-Q)90 | 8 | |
| Subtasks | BrowseComp+ Subtasks 1st Pass | Accuracy97.7 | 8 | |
| Web Browsing Reasoning | BrowseComp+ | Avg@8 Accuracy11 | 7 | |
| Scaling Model Validation | BrowseComp-Plus Out-of-sample (val) | MAE0.071 | 1 |