| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BigCodeBench | Qwen3-32B | Recall@128.2 | 11 | 1mo ago | |
| CHAMP | Recall@122.5 | 11 | 1mo ago | ||
| LogicBench | Qwen3-32B | Recall@131.4 | 11 | 1mo ago | |
| BigCodeBench | Qwen3-32B | nDCG@173.2 | 11 | 1mo ago | |
| MedCalc-Bench | Qwen3-235B | nDCG@192.3 | 11 | 1mo ago | |
| CHAMP | Qwen3-32B | nDCG@135.4 | 11 | 1mo ago | |
| ToolQA | Qwen3-235B | nDCG@156.4 | 11 | 1mo ago | |
| LogicBench | Qwen3-32B | nDCG@131.4 | 11 | 1mo ago | |
| TheoremQA | Qwen3-32B | nDCG@177.4 | 11 | 1mo ago | |
| ToolBench | ShardMemo | Precision@R97 | 7 | 3mo ago | |
| SkillsBench | SkillFlow | Mean Skills Retrieved per Task2.8 | 4 | 2mo ago | |
| Terminal-Bench | SkillFlow | Mean Skills Retrieved per Task1.5 | 3 | 2mo ago |