| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | Coding Eval+ LiveCode (test) | Eval+ Score87.2 | 32 | |
| Response correctness and completeness evaluation | Coding | F1 Score85 | 32 | |
| Multi-Agent System Performance | Coding | TS Score65 | 16 | |
| Coding | Coding (val) | Pass@16100 | 16 | |
| Reasoning | Coding | Normalized Score102.1 | 9 | |
| Prompt Injection Detection | Coding Direct Prompt Injection | FPR0 | 7 | |
| Code Generation | Coding Gender (test) | Cor (%)40 | 5 | |
| Code Generation | Coding Race (test) | Correctness Rate57 | 5 | |
| Agentic Coding | Coding unseen tasks (test) | Pass@129.2 | 3 | |
| Coding | Coding Hard | Baseline Score36.67 | 1 | |
| Coding | Coding Medium | Baseline Score31.96 | 1 | |
| Coding | Coding Easy | Baseline51.14 | 1 |