| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | BigCodeBench | Accuracy83.84 | 59 | |
| Code Generation | BigCodeBench-Completion Full | pass@159.7 | 41 | |
| Code Generation | BigCodeBench-Completion Hard | pass@136.5 | 38 | |
| Code Generation | BigCodeBench-Instruct Hard | Pass@127 | 37 | |
| Code Generation | BigCodeBench-Instruct (Full) | Pass@10.501 | 37 | |
| Code Safety Evaluation | BigCodeBench 1.0 (test) | Safety Score99.9 | 24 | |
| Code Completion | BigCodeBench Hard | Pass@116.2 | 20 | |
| Code Completion | BigCodeBench Full | Pass@146.1 | 20 | |
| Code Generation | BigCodeBench | pass@141.44 | 18 | |
| Code Completion | BigCodeBench | Full Score45.8 | 17 | |
| Code Generation | BigCodeBench instruct | Full Score0.41 | 14 | |
| Coding | BigCodeBench Full | Score41.58 | 10 | |
| Code Generation | BigCodeBench (hold-out) | Pass@135.8 | 8 | |
| Sabotage detection | BigCodeBench Sabotage (reasoning LLM attacker) | log-AUROC0.87 | 8 | |
| Sabotage detection | BigCodeBench-Sabotage traditional LLM attacker | log-AUROC0.84 | 8 | |
| Verified Code Gen. | BigCodeBench (test) | Pass Rate38.97 | 6 | |
| Code Generation | BigCodeBench instruction split (test) | Pass Rate37.86 | 6 | |
| Code Generation | BigCodeBench | Last Epoch Success Rate59.5 | 6 | |
| Code Generation | BigCodeBench (val) | Success Rate50.8 | 6 | |
| Code Generation | BigCodeBench Full 1.0 | Pass@1459 | 3 | |
| Code Reasoning | BigCodeBench | BigCodeBench Score40 | 3 |