Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Accuracy and Consistency Metrics on MT-Bench (LLM-as-a-Judge)

81.4Accuracy

PA-GRPO

29.431242.923156.41569.9069Oct 9, 2025Nov 5, 2025Dec 2, 2025Dec 30, 2025Jan 26, 2026Feb 22, 2026Mar 22, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
81.491.674.8
2026.03
8190.673.7
2026.03
79.486.572.8
2026.03
78.886.472
2026.03
78.879.171.2
2026.03
78.583.470.3
2026.03
78.182.569.3
2026.03
77.68871.7
2026.03
77.286.172.1
2026.03
76.585.671.1
2026.03
76.184.670.4
2026.03
75.780.665.4
2026.03
75.183.268.9
2026.03
72.666.752.1
2026.03
72.462.150.9
2026.03
71.256.349.4
2026.03
68.559.448.2
2026.03
6752.743.4
2026.03
65.646.238.9
2026.03
62.342.133.4
2026.03
59.625.222.2
2025.10
58.42--
2025.10
56.99--
2025.10
56.42--
2025.10
56.13--
2025.10
55.44--
2025.10
55.44--
2025.10
55.12--
2025.10
54.96--
2025.10
53--
2025.10
52.16--
2025.10
51.27--
2025.10
50.52--
2025.10
50.31--
2025.10
49.54--
2025.10
49.51--
2025.10
48.26--
2025.10
46.26--
2025.10
45.66--
2025.10
44.62--
2025.10
43.4--
2025.10
42.4--
2025.10
40.86--
2025.10
31.43--