Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multistep Reasoning on MUSR (Accuracy)

61.67Accuracy

Base

16.107627.936339.76551.5937May 27, 2025Jul 11, 2025Aug 26, 2025Oct 11, 2025Nov 25, 2025Jan 10, 2026Feb 25, 2026
Updated 9d ago

Evaluation Results

MethodLinks
2025.05
61.67
2026.02
60.8
2026.02
60.8
2026.02
53.5
2025.10
50.13
2025.10
49.47
2025.10
48.68
2025.10
46.16
2025.10
45.77
2025.10
42.86
2026.02
39.6
2026.02
39.4
2026.02
38.7
2026.02
38.7
2026.02
38.7
2026.02
38.7
2026.02
38.5
2026.02
38.5
2026.02
38.3
2026.02
38.3
2026.02
37.9
37.9
2026.02
37.7
37.7
2026.02
36.6
36.6
2025.05
32.31
2025.05
30.95
2025.05
30.71
2025.05
18.05
2025.05
17.86