| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | HumanEval | Pass@17,927 | 850 | |
| Code Generation | HumanEval (test) | Pass@1100 | 444 | |
| Code Generation | HumanEval+ | Pass@192.7 | 189 | |
| Code Generation | HumanEval 1.0 (test) | Pass@185.4 | 145 | |
| Code Generation | HumanEval | Pass@194.1 | 108 | |
| Code Generation | HumanEval | Accuracy (%)63.8 | 77 | |
| Code Generation | HumanEval-ET | Pass@189.6 | 75 | |
| Code Generation | HumanEval | Tokens/s287.77 | 61 | |
| Inference Efficiency | HumanEval | Speedup Factor5.15 | 54 | |
| Coding | HumanEval | Pass@198.17 | 52 | |
| Code Generation | HumanEval Multilingual (test) | Average Score76.5 | 52 | |
| Code Generation | HumanEval | Accuracy81.7 | 51 | |
| Code | HumanEval | HumanEval Accuracy93.4 | 50 | |
| Code Generation | HumanEval | HumanEval Score93 | 50 | |
| Function-level Code Generation | HumanEval+ augmented (test) | Pass@190 | 46 | |
| Code Generation | HumanEval | Average Tau (τ)1.94 | 45 | |
| Code Generation | HumanEval @WizardCoder (test) | Pass@171.95 | 45 | |
| Code Debugging | HumanEval | Accuracy96.3 | 42 | |
| Code Generation | HumanEval | TPS222.68 | 41 | |
| Code Reasoning | HumanEval | HumanEval Score95.73 | 35 | |
| Code Completion | HumanEval+ | Pass@156.7 | 33 | |
| Code Verification | HumanEval+ | Pass@187.05 | 32 | |
| Coding | HumanEval+ | Pass@195.12 | 31 | |
| Code Generation | HumanEval OOD | Pass@132.31 | 30 | |
| Code Generation | HumanEval | Functional Score M16.47 | 29 |