Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BigCodeBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Code GenerationBigCodeBench
Accuracy83.84
59
Code GenerationBigCodeBench-Completion Full
pass@159.7
41
Code GenerationBigCodeBench-Completion Hard
pass@136.5
38
Code GenerationBigCodeBench-Instruct Hard
Pass@127
37
Code GenerationBigCodeBench-Instruct (Full)
Pass@10.501
37
Code Safety EvaluationBigCodeBench 1.0 (test)
Safety Score99.9
24
Code CompletionBigCodeBench Hard
Pass@116.2
20
Code CompletionBigCodeBench Full
Pass@146.1
20
Code GenerationBigCodeBench
pass@141.44
18
Code CompletionBigCodeBench
Full Score45.8
17
Code GenerationBigCodeBench instruct
Full Score0.41
14
CodingBigCodeBench Full
Score41.58
10
Code GenerationBigCodeBench (hold-out)
Pass@135.8
8
Sabotage detectionBigCodeBench Sabotage (reasoning LLM attacker)
log-AUROC0.87
8
Sabotage detectionBigCodeBench-Sabotage traditional LLM attacker
log-AUROC0.84
8
Verified Code Gen.BigCodeBench (test)
Pass Rate38.97
6
Code GenerationBigCodeBench instruction split (test)
Pass Rate37.86
6
Code GenerationBigCodeBench
Last Epoch Success Rate59.5
6
Code GenerationBigCodeBench (val)
Success Rate50.8
6
Code GenerationBigCodeBench Full 1.0
Pass@1459
3
Code ReasoningBigCodeBench
BigCodeBench Score40
3
Showing 21 of 21 rows