Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BigCodeBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Code GenerationBigCodeBench
Accuracy83.84
71
Code GenerationBigCodeBench-Instruct Hard
Pass@128.4
48
Code GenerationBigCodeBench-Instruct (Full)
Pass@10.504
48
Code GenerationBigCodeBench-Completion Full
pass@159.7
41
Code GenerationBigCodeBench Hard
Pass@135.1
38
Code GenerationBigCodeBench Full
Pass@154.2
38
Code GenerationBigCodeBench-Completion Hard
pass@136.5
38
Code Safety EvaluationBigCodeBench 1.0 (test)
Safety Score99.9
24
Code CompletionBigCodeBench Hard
Pass@116.2
20
Code CompletionBigCodeBench Full
Pass@146.1
20
Code GenerationBigCodeBench
pass@188.5
18
Code GenerationBigCodeBench
pass@141.44
18
Code CompletionBigCodeBench
Full Score45.8
17
Code GenerationBigCodeBench (BCB) 342 tasks 30% held-out (unseen)
Success Rate (SR)55.8
15
Code GenerationBigCodeBench instruct
Full Score0.41
14
Code GenerationBigCodeBench
avg@3252.46
12
Code GenerationBigCodeBench-I Hard
Score28.4
11
Code GenerationBigCodeBench-I Full
Score50.4
11
Code GenerationBigCodeBench Instruct Full (train)
Last SR83.3
10
Code GenerationBigCodeBench
tau4.18
10
CodingBigCodeBench Full
Score41.58
10
Code GenerationBigCodeBench (hold-out)
Pass@135.8
8
Sabotage detectionBigCodeBench Sabotage (reasoning LLM attacker)
log-AUROC0.87
8
Sabotage detectionBigCodeBench-Sabotage traditional LLM attacker
log-AUROC0.84
8
Quantization DetectionBigCodeBench
RUT0.041
6
Showing 25 of 31 rows