| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Generation | APPS | Pass@191.2 | 111 | |
| Code Correctness Evaluation | APPS | Accuracy80 | 53 | |
| Code Generation | APPS (test) | Introductory Score56.3 | 36 | |
| Code Generation | APPS Intermediate | Pass Rate81.95 | 32 | |
| Code Safety Evaluation | APPS 1.0 (test) | Safety Score0.988 | 30 | |
| Code Generation | APPS | Accuracy45 | 29 | |
| Code Generation | APPS Introductory | pass@192.1 | 25 | |
| Code Generation | APPS Competition | pass@138 | 20 | |
| Code Generation | APPS Overall | PR21.38 | 18 | |
| Meta-reasoning quality assessment | APPS | Thoroughness85.6 | 12 | |
| Code Generation | APPS | Pass@483.2 | 12 | |
| Program Synthesis | APPS 1.0 (test) | Pass@5 (Introductory)25.61 | 11 | |
| Code Generation | APPS | Tau5.65 | 10 | |
| Code Generation | APPS Interview | Pass@12.64 | 9 | |
| Code Generation | APPS interview-level (test) | Mean Score0.5717 | 8 | |
| Watermark message recovery | APPS-G | Message Accuracy100 | 8 | |
| Code Peak-Memory Prediction | APPS | Correlation (rho)0.96 | 7 | |
| Competitive Programming | APPS (val) | Pass@172.72 | 6 | |
| Monitoring | APPS (test) | pAUC81.6 | 6 | |
| Code metric regression | APPS Leetcode (test) | RMSE0.474 | 6 | |
| Code Generation | APPS+ | Pass@1 (Introductory)1.94 | 5 | |
| Code Generation | APPS+ Competition | Pass@12.67 | 5 | |
| Coding Reasoning | Apps | Pass Rate68.3 | 5 | |
| Program Synthesis | APPS | Pass@5 (Introductory)25.61 | 5 | |
| Dafny Code Synthesis | APPS Vericoding-derived (test) | Pass Rate83 | 4 |