Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Malicious Skill Detection on MaliciousAgentSkillsBench (404 malicious, 502 benign)
Loading...
100
Recall
BIV
32.4
49.95
67.5
85.05
May 12, 2026
Recall
Precision
F1 Score
Updated 21d ago
Evaluation Results
Method
Method
Links
Recall
Precision
F1 Score
BIV
Model=Claude Sonnet 4.5
2026.05
100
99
99
BIV
Model=Claude Sonnet 4.5
2026.05
98
92
95
LLM-only
2026.05
93
99
96
LLM-only
2026.05
89
97
93
BIV
Model=Claude Sonnet 4.5
2026.05
86
57
69
LLM-only
2026.05
64
74
68
Rule-based
2026.05
45
88
60
Rule-based
2026.05
39
20
26
Rule-based
2026.05
35
62
44
Feedback
Search any
task
Search any
task