Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CRUXEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Code ReasoningCRUXEval-O
Accuracy83.5
61
Code ReasoningCRUXEval
Input-CoT Accuracy98.8
56
Code Output PredictionCRUXEval-O
Pass@179.1
52
Code Input PredictionCRUXEval-I
Pass@167.5
52
Code ReasoningCRUXEval
Accuracy76.87
36
CodingCRUXEval-O
Score87.5
19
Code ReasoningCruxEval Output
Score51
12
Code ReasoningCRUXEval I
Accuracy74
9
Code-reasoningCRUXEval-O
Pass@141
8
CodingCRUXEval
Pass@188.5
8
Output PredictionCruxEval
Pass@194.13
6
Reasoning failure prediction and recoveryCRUXEval L2
Accuracy77
4
CodeCruxEval o
Exact Match35.3
4
CodeCruxEval-i
Exact Match36.2
4
Input PredictionCruxEval
Pass@1 (inv_step_call)66.5
3
Code Reasoning (Output Prediction)CRUXEval-O 1-shot
Accuracy84.01
3
Code Reasoning (Input Prediction)CRUXEval-I 1-shot
Accuracy79.75
3
Reasoning failure prediction and recoveryCRUXEval (L3)
Accuracy74
2
Reasoning failure prediction and recoveryCRUXEval L1
Accuracy89
2
Showing 19 of 19 rows