Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Sub-task Completion on AI-Pentest-Benchmark Single Experiment
Loading...
46
AC Score
Qwen3-32B-finetune (Ours)
14.8
22.9
31
39.1
Sep 16, 2025
AC Score
WS Score
NS Score
CRPT Score
Real-world Success Count
Overall Success Count (ALL)
Updated 1mo ago
Evaluation Results
Method
Method
Links
AC Score
WS Score
NS Score
CRPT Score
Real-world Success Count
Overall Success Count (ALL)
Qwen3-32B-finetune (Ours)
Backbone=Qwen3-32B, Fr...
2025.09
46
38
15
38
114
251
Llama3.1-405B (VulnBot)
Backbone=Llama3.1-405B...
2025.09
31
30
11
18
55
145
Qwen3-32B (Base)
Backbone=Qwen3-32B, Fr...
2025.09
26
28
11
22
79
166
Llama3.3-70B (VulnBot)
Backbone=Llama3.3-70B,...
2025.09
25
24
12
15
49
125
Llama3.1-405B (Base)
Backbone=Llama3.1-405B...
2025.09
21
26
9
18
29
103
Llama3.1-405B (PentestGPT)
Backbone=Llama3.1-405B...
2025.09
20
18
6
12
28
84
Llama3.3-70B (Base)
Backbone=Llama3.3-70B,...
2025.09
16
22
10
17
29
94
Feedback
Search any
task
Search any
task