Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Benign completion reliability on Agent Security Bench Benign

99Completion Reliability

GPT-5

41.856.6571.586.35Mar 3, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
99
2026.03
98
2026.03
93
2026.03
91
2026.03
90
2026.03
89
2026.03
85
2026.03
84
2026.03
78
2026.03
44