Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SkillsBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Downstream skill executionSkillsBench (test)
Reward48.9
30
Skill executionSkillsBench
Overall Success Rate (avg@5)56.9
26
Prefix-risk rankingSkillsBench (held-out)
AUPRC53.3
11
Agent Task CompletionSkillsBench
Pass Rate22.6
9
Agent task executionSkillsBench 1.0 (test)
Pass Rate (With Skills)71.1
8
Skill-based Task ExecutionSkillsBench
Accuracy68.4
6
Downstream task executionSkillsBench
Reward Mean (%)29.26
6
Video IndexSkillsBench
Solve Tokens105,000
4
PPTXSkillsBench
Tokens Used (Total)39,000
4
Offer LetterSkillsBench
Tokens Used41,000
4
Mars CloudsSkillsBench
Solve Tokens (K)52
4
JAXSkillsBench
Solve Tokens30,000
4
CitationSkillsBench
Solve Tokens Count66
4
3D ScanSkillsBench
Solve Tokens (K)15
4
Skill GenerationSkillsBench
Baseline Success Rate22
4
Skill-assisted task executionSkillsBench 1.0 (test)
Pass@119.5
4
Skill RetrievalSkillsBench
Mean Skills Retrieved per Task2.8
4
Agentic Skill ExecutionSkillsBench Gemini CLI
Pass Count4
2
Agentic Skill ExecutionSkillsBench Codex CLI
Pass Count11
2
Agentic Skill ExecutionSkillsBench Kimi CLI
Success Count36
2
Agentic Skill ExecutionSkillsBench Claude Code CLI
Pass Count9
2
Showing 21 of 21 rows