Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CRUXEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Code ReasoningCRUXEval
Input-CoT Accuracy98.8
56
Code Output PredictionCRUXEval-O
Pass@149.8
47
Code Input PredictionCRUXEval-I
Pass@150
47
Code ReasoningCRUXEval-O
Accuracy83.5
28
Code ReasoningCRUXEval
Accuracy68.6
21
Code ReasoningCruxEval Output
Score51
12
CodingCRUXEval-O
Score87.5
10
CodingCRUXEval
Pass@155.9
6
Code ReasoningCRUXEval I
Accuracy74
4
CodeCruxEval o
Exact Match35.3
4
CodeCruxEval-i
Exact Match36.2
4
Output PredictionCruxEval
Pass@1 (step_return)77.9
3
Input PredictionCruxEval
Pass@1 (inv_step_call)66.5
3
Code Reasoning (Output Prediction)CRUXEval-O 1-shot
Accuracy84.01
3
Code Reasoning (Input Prediction)CRUXEval-I 1-shot
Accuracy79.75
3
Showing 15 of 15 rows