Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Quality Evaluation on Security reasoning tasks
Loading...
35.6
Functional Efficacy Score
Force Weak
35.288
35.369
35.45
35.531
Jan 27, 2026
Functional Efficacy Score
Safety & Ethical Compliance Score
Robustness & Automation Score
Quality & Cleanliness Score
Updated 3mo ago
Evaluation Results
Method
Method
Links
Functional Efficacy Score
Safety & Ethical Compliance Score
Robustness & Automation Score
Quality & Cleanliness Score
Force Weak
Routing Strategy=Force...
2026.01
35.6
27.1
18.1
9
CASTER
Routing Strategy=CASTER
2026.01
35.5
27.6
18.1
9.1
Force Strong
Routing Strategy=Force...
2026.01
35.3
26.8
17.9
8.9
Feedback
Search any
task
Search any
task