Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

About

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.

Yinghan Hou, Zongyou Yang, Zaihu Pang, Xiujun Ma• 2026

Related benchmarks

TaskDatasetResultRank
Malicious Skill DetectionClawHub Overall 1.0
Overall Balance84
9
Malicious Skill DetectionClawHub Command Injection 1.0 (n=27)
Catch Rate85
9
Malicious Skill DetectionClawHub Prompt Injection 1.0 (n=19)
Catch Rate79
9
Malicious Skill DetectionClawHub
Overall Detection Rate84
9
Malicious Skill DetectionClawHub Unsafe File Ops 1.0 (n=10)
Catch Rate80
9
Vulnerability DetectionSkillVetBench Command Injection
Malicious Verdict Count0.00e+0
9
Vulnerability DetectionSkillVetBench Prompt Injection
Malicious Verdict Count0.00e+0
9
Vulnerability DetectionSkillVetBench Unsafe File Ops
Malicious Verdict Count0.00e+0
9
Vulnerability DetectionSkillVetBench Data Exposure
Malicious Verdict Count0.00e+0
9
Vulnerability DetectionSkillVetBench Supply Chain
Malicious Verdict Count0.00e+0
9
Showing 10 of 13 rows

Other info

Follow for update