| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Code Reasoning | CRUXEval-O | Accuracy83.5 | 61 | |
| Code Reasoning | CRUXEval | Input-CoT Accuracy98.8 | 56 | |
| Code Output Prediction | CRUXEval-O | Pass@179.1 | 52 | |
| Code Input Prediction | CRUXEval-I | Pass@167.5 | 52 | |
| Code Reasoning | CRUXEval | Accuracy76.87 | 36 | |
| Coding | CRUXEval-O | Score87.5 | 19 | |
| Code Reasoning | CruxEval Output | Score51 | 12 | |
| Code Reasoning | CRUXEval I | Accuracy74 | 9 | |
| Code-reasoning | CRUXEval-O | Pass@141 | 8 | |
| Coding | CRUXEval | Pass@188.5 | 8 | |
| Output Prediction | CruxEval | Pass@194.13 | 6 | |
| Reasoning failure prediction and recovery | CRUXEval L2 | Accuracy77 | 4 | |
| Code | CruxEval o | Exact Match35.3 | 4 | |
| Code | CruxEval-i | Exact Match36.2 | 4 | |
| Input Prediction | CruxEval | Pass@1 (inv_step_call)66.5 | 3 | |
| Code Reasoning (Output Prediction) | CRUXEval-O 1-shot | Accuracy84.01 | 3 | |
| Code Reasoning (Input Prediction) | CRUXEval-I 1-shot | Accuracy79.75 | 3 | |
| Reasoning failure prediction and recovery | CRUXEval (L3) | Accuracy74 | 2 | |
| Reasoning failure prediction and recovery | CRUXEval L1 | Accuracy89 | 2 |