Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Relative Robustness Analysis on Combined Past Tense, OR-Bench, MMLU

78.9R-Score

Only Scaling

3.000822.705442.4162.1146Sep 30, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.09
78.946.4
2025.09
75.639.8
2025.09
74.658.8
2025.09
73.746.3
2025.09
72.841.9
2025.09
72.446.7
2025.09
71.852.9
2025.09
71.650.3
2025.09
70.537.2
2025.09
70.144.6
2025.09
69.835.4
2025.09
69.536.7
2025.09
68.644.3
2025.09
6835
2025.09
67.936
2025.09
67.637.3
2025.09
67.135.5
2025.09
66.938.5
2025.09
66.935
2025.09
66.458.7
2025.09
66.335.7
2025.09
6637
2025.09
65.748.4
2025.09
65.533.8
2025.09
65.535.8
2025.09
64.934.9
2025.09
64.532.8
2025.09
64.337.7
2025.09
61.933.9
2025.09
60.240.6
2025.09
58.648.3
2025.09
57.335.1
2025.09
5647
2025.09
52.245.6
2025.09
30.636.3
2025.09
5.928.96