| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Musique | R1-Searcher | Accuracy87 | 31 | 4d ago | |
| HLE | R1-Searcher | Avg Score85 | 23 | 4d ago | |
| MuSiQue | Llama3.1-8B + ARPO | F1 Score34.8 | 18 | 2d ago | |
| Bamboogle | Llama3.1-8B + ARPO | F173.8 | 18 | 2d ago | |
| 2wikiMultiHopQA | Qwen2.5-7B + GRPO | F1 Score76.1 | 18 | 2d ago | |
| HotpotQA | Llama3.1-8B + ARPO | F1 Score0.654 | 18 | 2d ago | |
| WebWalker | Llama3.1-8B + ARPO | F1 Score30.5 | 18 | 2d ago | |
| 2WikiMultiHopQA | AutoTool (Qwen3-8B) | Accuracy48.8 | 18 | 4d ago | |
| HQA | AutoTraj | Average Score87 | 18 | 4d ago | |
| 2Wiki | AutoTraj | Average Score0.89 | 9 | 4d ago | |
| MMLU-CF first 1,000 samples (test) | MGRS | Exact Match Accuracy74.2 | 7 | 4d ago | |
| Knowledge-intensive reasoning suite (HotpotQA, 2WikiMultihopQA, Musique) | TEPOdense | HotpotQA Score43.6 | 6 | 4d ago | |
| Generalization Verification | KDCM + Code Module | Hits@199.18 | 5 | 4d ago |