| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Musique | R1-Searcher | Accuracy87 | 31 | 1mo ago | |
| Knowledge-Intensive Reasoning Suite 2Wiki., Bamb., HQA, MuSi., SimQA | 2Wiki Score58.4 | 25 | 5d ago | ||
| HLE | R1-Searcher | Avg Score85 | 23 | 1mo ago | |
| MuSiQue | Llama3.1-8B + ARPO | F1 Score34.8 | 18 | 1mo ago | |
| Bamboogle | Llama3.1-8B + ARPO | F173.8 | 18 | 1mo ago | |
| 2wikiMultiHopQA | Qwen2.5-7B + GRPO | F1 Score76.1 | 18 | 1mo ago | |
| HotpotQA | Llama3.1-8B + ARPO | F1 Score0.654 | 18 | 1mo ago | |
| WebWalker | Llama3.1-8B + ARPO | F1 Score30.5 | 18 | 1mo ago | |
| 2WikiMultiHopQA | AutoTool (Qwen3-8B) | Accuracy48.8 | 18 | 1mo ago | |
| HQA | AutoTraj | Average Score87 | 18 | 1mo ago | |
| GPQA | CPPO | Result Score38.89 | 14 | 6d ago | |
| 2Wiki | AutoTraj | Average Score0.89 | 9 | 1mo ago | |
| MMLU-CF first 1,000 samples (test) | MGRS | Exact Match Accuracy74.2 | 7 | 1mo ago | |
| Knowledge-intensive reasoning suite (HotpotQA, 2WikiMultihopQA, Musique) | TEPOdense | HotpotQA Score43.6 | 6 | 1mo ago | |
| Generalization Verification | KDCM + Code Module | Hits@199.18 | 5 | 1mo ago |