| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Downstream skill execution | SkillsBench (test) | Reward48.9 | 30 | |
| Skill execution | SkillsBench | Overall Success Rate (avg@5)56.9 | 26 | |
| Prefix-risk ranking | SkillsBench (held-out) | AUPRC53.3 | 11 | |
| Agent Task Completion | SkillsBench | Pass Rate22.6 | 9 | |
| Agent task execution | SkillsBench 1.0 (test) | Pass Rate (With Skills)71.1 | 8 | |
| Skill-based Task Execution | SkillsBench | Accuracy68.4 | 6 | |
| Downstream task execution | SkillsBench | Reward Mean (%)29.26 | 6 | |
| Video Index | SkillsBench | Solve Tokens105,000 | 4 | |
| PPTX | SkillsBench | Tokens Used (Total)39,000 | 4 | |
| Offer Letter | SkillsBench | Tokens Used41,000 | 4 | |
| Mars Clouds | SkillsBench | Solve Tokens (K)52 | 4 | |
| JAX | SkillsBench | Solve Tokens30,000 | 4 | |
| Citation | SkillsBench | Solve Tokens Count66 | 4 | |
| 3D Scan | SkillsBench | Solve Tokens (K)15 | 4 | |
| Skill Generation | SkillsBench | Baseline Success Rate22 | 4 | |
| Skill-assisted task execution | SkillsBench 1.0 (test) | Pass@119.5 | 4 | |
| Skill Retrieval | SkillsBench | Mean Skills Retrieved per Task2.8 | 4 | |
| Agentic Skill Execution | SkillsBench Gemini CLI | Pass Count4 | 2 | |
| Agentic Skill Execution | SkillsBench Codex CLI | Pass Count11 | 2 | |
| Agentic Skill Execution | SkillsBench Kimi CLI | Success Count36 | 2 | |
| Agentic Skill Execution | SkillsBench Claude Code CLI | Pass Count9 | 2 |