Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Incorrect Reasoning Path Detection on GSM8K

98.31Accuracy

DeepConf

40.392455.428770.46585.5013May 16, 2025
Updated 5d ago

Evaluation Results

MethodLinks
2025.05
98.3181.9497.88
2025.05
97.7980.797.44
2025.05
97.6980.9297.81
2025.05
97.6980.1597.36
2025.05
97.2377.497.35
2025.05
97.2374.9596.61
2025.05
97.0875.3797.02
2025.05
96.5683.396.23
2025.05
96.2680.695.65
2025.05
96.1578.8195.76
2025.05
96.0577.6695.46
2025.05
95.6978.0695.63
2025.05
95.5481.0195.53
2025.05
95.4980.9795.52
2025.05
94.9275.5494.93
2025.05
94.7777.2295.66
2025.05
94.6778.3194.91
2025.05
94.4676.4495.06
2025.05
94.2670.7593.6
2025.05
93.2373.9893.37
2025.05
93.2374.0393.37
2025.05
92.6267.2292.24
2025.05
92.4672.2192.64
2025.05
92.15--
2025.05
88.2158.8687.44
2025.05
87.9960.1689.24
2025.05
87.72--
2025.05
86.7755.6187.16
2025.05
85.69--
2025.05
84.8747.4784.69
2025.05
82.7741.9482.19
2025.05
71.9966.675.72
2025.05
62.7775.769.72
2025.05
62.3175.0770.29
2025.05
62.2175.0370.22
2025.05
61.3873.4168.38
2025.05
59.8571.2161.61
2025.05
59.7471.7966.4
2025.05
59.6249.0560.02
2025.05
59.5471.0161.29
2025.05
57.3869.0158.51
2025.05
48.9256.6448.22
2025.05
45.7953.6646.03
2025.05
44.43--
2025.05
43.9550.2843.24
2025.05
42.6250.6445.09