| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BrowseComp (test) | GPT-5-High | Accuracy54.9 | 19 | 28d ago | |
| BrowseComp-ZH (test) | Accuracy68.7 | 17 | 28d ago | ||
| MM Search | AXPO | Pass@461 | 16 | 6d ago | |
| HR-MM Search | AXPO | Pass@442 | 16 | 6d ago | |
| MM Search | Pass@146.1 | 16 | 6d ago | ||
| HR-MM Search | SFT + AXPO | Pass@125.9 | 16 | 6d ago | |
| Amazon | GEMS | Hit Rate @ 583.99 | 15 | 3mo ago | |
| Humanity's Last Exam (HLE) (test) | Accuracy45.8 | 14 | 28d ago | ||
| BrowseComp | Score67.6 | 11 | 2mo ago | ||
| BrowseComp-ZH | Score81.3 | 10 | 2mo ago | ||
| xbench (test) | OpenSeeker-v2-30B-SFT | Accuracy78 | 9 | 28d ago | |
| XBench | OpenSeeker-v1-Data-11.7k | Score74 | 9 | 2mo ago | |
| HLE text | Score45.8 | 7 | 3mo ago | ||
| WebWalker | Qwen3-235B | Score59.5 | 7 | 3mo ago | |
| Frames | Qwen3-235B | Score70.5 | 7 | 3mo ago | |
| Multi-agent Simulation averaged across 6 impairment dimensions | AURC (% of max)100 | 5 | 2mo ago | ||
| Multi-agent Communication Environment (test) | Mean Normalized Performance Drop0 | 5 | 2mo ago | ||
| Large environment | COMRES-VLM | Average Completion Time (timesteps)162.24 | 3 | 3mo ago | |
| Medium environment | COMRES-VLM | Average Completion Time (timesteps)104.25 | 3 | 3mo ago | |
| Small environment | COMRES-VLM | Average Completion Time (timesteps)63.21 | 3 | 3mo ago | |
| MAT-Search | Qwen2.5-VL-3B | F1 Score27.1 | 2 | 3mo ago |