Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Benign completion reliability on Agent Security Bench Benign
Loading...
99
Completion Reliability
GPT-5
41.8
56.65
71.5
86.35
Mar 3, 2026
Completion Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Completion Reliability
GPT-5
Safety Scaffolding=MOS...
2026.03
99
GPT-5
Safety Scaffolding=Bas...
2026.03
98
GPT-4o
Safety Scaffolding=MOS...
2026.03
93
Phi-4
Safety Scaffolding=MOS...
2026.03
91
Qwen2.5-7B
Safety Scaffolding=Bas...
2026.03
90
GPT-4o
Safety Scaffolding=Bas...
2026.03
89
Qwen3-4B-Think
Safety Scaffolding=MOS...
2026.03
85
Qwen2.5-7B
Safety Scaffolding=MOS...
2026.03
84
Phi-4
Safety Scaffolding=Bas...
2026.03
78
Qwen3-4B-Think
Safety Scaffolding=Bas...
2026.03
44
Feedback
Search any
task
Search any
task