| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | BigCodeBench | Accuracy83.84 | 73 | |
| Code Generation | BigCodeBench-Instruct Hard | Pass@128.4 | 48 | |
| Code Generation | BigCodeBench-Instruct (Full) | Pass@10.504 | 48 | |
| Code Generation | BigCodeBench-Completion Full | pass@159.7 | 41 | |
| Code Generation | BigCodeBench Hard | Pass@135.1 | 38 | |
| Code Generation | BigCodeBench Full | Pass@154.2 | 38 | |
| Code Generation | BigCodeBench-Completion Hard | pass@136.5 | 38 | |
| Code Safety Evaluation | BigCodeBench 1.0 (test) | Safety Score99.9 | 24 | |
| Code Evaluation | BigCodeBench | Accuracy82.02 | 23 | |
| Code Generation | BigCodeBench Lite-Pro Compositional Stream | Accuracy66.7 | 20 | |
| Code Generation | BigCodeBench | Mean Accuracy46.7 | 20 | |
| Code Completion | BigCodeBench Hard | Pass@116.2 | 20 | |
| Code Completion | BigCodeBench Full | Pass@146.1 | 20 | |
| Code Generation | BigCodeBench | pass@188.5 | 18 | |
| Code Generation | BigCodeBench | pass@141.44 | 18 | |
| Code Completion | BigCodeBench | Full Score45.8 | 17 | |
| Code Generation | BigCodeBench Lite-Pro Naive Stream | Accuracy44.8 | 16 | |
| Coding | BigCodeBench Hard | Pass@133.8 | 15 | |
| Coding | BigCodeBench Full | pass@154 | 15 | |
| Code Generation | BigCodeBench (BCB) 342 tasks 30% held-out (unseen) | Success Rate (SR)55.8 | 15 | |
| Code Generation | BigCodeBench instruct | Full Score0.41 | 14 | |
| Code Generation | BigCodeBench | avg@3252.46 | 12 | |
| Skill retrieval | BigCodeBench | Recall@128.2 | 11 | |
| Skill retrieval | BigCodeBench | nDCG@173.2 | 11 | |
| Code Generation | BigCodeBench-I Hard | Score28.4 | 11 |