SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
About
OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Malicious Skill Detection | ClawHub Overall 1.0 | Overall Balance84 | 9 | |
| Malicious Skill Detection | ClawHub Command Injection 1.0 (n=27) | Catch Rate85 | 9 | |
| Malicious Skill Detection | ClawHub Prompt Injection 1.0 (n=19) | Catch Rate79 | 9 | |
| Malicious Skill Detection | ClawHub | Overall Detection Rate84 | 9 | |
| Malicious Skill Detection | ClawHub Unsafe File Ops 1.0 (n=10) | Catch Rate80 | 9 | |
| Vulnerability Detection | SkillVetBench Command Injection | Malicious Verdict Count0.00e+0 | 9 | |
| Vulnerability Detection | SkillVetBench Prompt Injection | Malicious Verdict Count0.00e+0 | 9 | |
| Vulnerability Detection | SkillVetBench Unsafe File Ops | Malicious Verdict Count0.00e+0 | 9 | |
| Vulnerability Detection | SkillVetBench Data Exposure | Malicious Verdict Count0.00e+0 | 9 | |
| Vulnerability Detection | SkillVetBench Supply Chain | Malicious Verdict Count0.00e+0 | 9 |