| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | BigCodeBench | Accuracy83.84 | 71 | |
| Code Generation | BigCodeBench-Instruct Hard | Pass@128.4 | 48 | |
| Code Generation | BigCodeBench-Instruct (Full) | Pass@10.504 | 48 | |
| Code Generation | BigCodeBench-Completion Full | pass@159.7 | 41 | |
| Code Generation | BigCodeBench Hard | Pass@135.1 | 38 | |
| Code Generation | BigCodeBench Full | Pass@154.2 | 38 | |
| Code Generation | BigCodeBench-Completion Hard | pass@136.5 | 38 | |
| Code Safety Evaluation | BigCodeBench 1.0 (test) | Safety Score99.9 | 24 | |
| Code Completion | BigCodeBench Hard | Pass@116.2 | 20 | |
| Code Completion | BigCodeBench Full | Pass@146.1 | 20 | |
| Code Generation | BigCodeBench | pass@188.5 | 18 | |
| Code Generation | BigCodeBench | pass@141.44 | 18 | |
| Code Completion | BigCodeBench | Full Score45.8 | 17 | |
| Code Generation | BigCodeBench (BCB) 342 tasks 30% held-out (unseen) | Success Rate (SR)55.8 | 15 | |
| Code Generation | BigCodeBench instruct | Full Score0.41 | 14 | |
| Code Generation | BigCodeBench | avg@3252.46 | 12 | |
| Code Generation | BigCodeBench-I Hard | Score28.4 | 11 | |
| Code Generation | BigCodeBench-I Full | Score50.4 | 11 | |
| Code Generation | BigCodeBench Instruct Full (train) | Last SR83.3 | 10 | |
| Code Generation | BigCodeBench | tau4.18 | 10 | |
| Coding | BigCodeBench Full | Score41.58 | 10 | |
| Code Generation | BigCodeBench (hold-out) | Pass@135.8 | 8 | |
| Sabotage detection | BigCodeBench Sabotage (reasoning LLM attacker) | log-AUROC0.87 | 8 | |
| Sabotage detection | BigCodeBench-Sabotage traditional LLM attacker | log-AUROC0.84 | 8 | |
| Quantization Detection | BigCodeBench | RUT0.041 | 6 |