Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Incorrect Reasoning Path Detection on MATH500

94Accuracy

TokUR (TU)

18.152837.843957.53577.2261May 16, 2025
Updated 5d ago

Evaluation Results

MethodLinks
2025.05
9484.6193.43
2025.05
9484.7493.65
2025.05
9281.8791.68
2025.05
87.3382.5588.3
2025.05
86.6781.4887.42
2025.05
85.669.5486.24
2025.05
84.5366.8683.85
2025.05
84.2778.9685.37
2025.05
82.1362.1980.19
2025.05
8262.5180.29
2025.05
81.3374.9584.76
2025.05
80.5374.3983.81
2025.05
76.868.9477.3
2025.05
76.1368.4276.82
2025.05
75.07--
2025.05
7482.4779.62
2025.05
7482.4379.56
2025.05
72.482.8681.35
2025.05
69.0776.4176.22
2025.05
66.2771.8669.57
2025.05
64.73--
2025.05
59.269.4263.74
2025.05
57.3362.9455.06
2025.05
55.7362.9355.21
2025.05
53.4758.6257.69
2025.05
53.0757.9849.72
2025.05
52.857.4149.69
2025.05
51.0755.3647.24
2025.05
49.650.2349.48
2025.05
48.6--
2025.05
44.6780.6456.79
2025.05
44.6780.6156.73
2025.05
44.1379.7456.64
2025.05
39.8771.7746
2025.05
38.1371.1748.37
2025.05
35.3333.4136.05
2025.05
31.3356.4127.01
2025.05
31.3357.0826.88
2025.05
30.9360.5736.32
2025.05
29.8755.4125.88
2025.05
29.255.7128.82
2025.05
27.654.3826.39
2025.05
25.6--
2025.05
25.248.7525.79
2025.05
24.1347.2925.71
2025.05
21.0744.5724.03