SkillsBench

Benchmarks

Task Name	Dataset Name	SOTA Result
Downstream skill execution	SkillsBench (test)	Reward48.9	30
Skill execution	SkillsBench	Overall Success Rate (avg@5)56.9	26
Agent task execution	SkillsBench 11 domains	Overall Score44.8	16
Skill-Augmented Agent Performance Evaluation	SkillsBench 86 runnable tasks (frozen-policy transfer)	Valid Task Count79	15
Downstream task performance	SkillsBench	Pass Rate51.1	12
Prefix-risk ranking	SkillsBench (held-out)	AUPRC53.3	11
Skill prediction	SkillsBench Real-task holdout n=65	Set F174.2	10
Skill prediction	SkillsBench Synthetic n=494 (test)	Set F173.9	10
Agent Task Completion	SkillsBench	Pass Rate22.6	9
Skill Selection	SkillsBench 1000-skill scale	Reward (%)36.8	8
Agent task execution	SkillsBench 1.0 (test)	Pass Rate (With Skills)71.1	8
Poisoning Injection Attack	SkillsBench n=81	Verification Rate25.9	7
Skill-based Task Execution	SkillsBench	Accuracy68.4	6
Downstream task execution	SkillsBench	Reward Mean (%)29.26	6
Skill-Augmented Agent Performance Evaluation	SkillsBench 20-task balanced panel (held-out evaluation)	Valid Task Count19	5
Skill-augmented Task Completion	SkillsBench 77 tasks (test)	Mean Reward0.46	4
Task completion pass rate	SkillsBench 77 tasks	Coverage72.7	4
Video Index	SkillsBench	Solve Tokens105,000	4
PPTX	SkillsBench	Tokens Used (Total)39,000	4
Offer Letter	SkillsBench	Tokens Used41,000	4
Mars Clouds	SkillsBench	Solve Tokens (K)52	4
JAX	SkillsBench	Solve Tokens30,000	4
Citation	SkillsBench	Solve Tokens Count66	4
3D Scan	SkillsBench	Solve Tokens (K)15	4
Skill Generation	SkillsBench	Baseline Success Rate22	4

Showing 25 of 32 rows