| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Deep research agents / Multi-step reasoning | BrowseComp-Plus OOD | Success Rate (SR)54.6 | 24 | |
| Long-context reasoning | BrowseComp+ 1K documents | Accuracy94.6 | 16 | |
| Web Browsing and Tool Use | BrowseComp+ original (test) | Performance (%)38.72 | 15 | |
| Web Browsing Reasoning | BrowseComp+ | Avg@8 Accuracy11 | 7 | |
| Scaling Model Validation | BrowseComp-Plus Out-of-sample (val) | MAE0.071 | 1 |