Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Correlation Metrics on MT-Bench (LLM-as-a-Judge)

0.689Pearson's r

Qwen3-32B REAL (ours)

0.26260.37330.4840.5947Mar 6, 2025May 7, 2025Jul 9, 2025Sep 10, 2025Nov 11, 2025Jan 13, 2026Mar 17, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
0.6890.6910.552
2025.03
0.6720.639-
2025.03
0.6180.614-
2026.03
0.6170.6080.471
2026.03
0.6110.5960.439
2026.03
0.5930.5690.422
2026.03
0.5580.5860.436
2025.03
0.5550.529-
2025.03
0.5470.583-
2025.03
0.5410.556-
2026.03
0.5410.5170.385
2026.03
0.5380.5110.389
2026.03
0.5290.5070.371
2026.03
0.5210.5010.366
2025.03
0.5190.483-
2026.03
0.5190.4830.392
2025.03
0.5170.503-
2025.03
0.5110.506-
2026.03
0.4950.5350.397
2025.03
0.4830.469-
2025.03
0.480.482-
2026.03
0.4670.4550.345
2025.03
0.4660.494-
2025.03
0.4350.426-
2025.03
0.4320.421-
2026.03
0.4250.4680.353
2026.03
0.4220.3710.299
2025.03
0.3990.418-
2026.03
0.3990.4180.307
2026.03
0.3670.3710.285
2026.03
0.3590.3550.27
2026.03
0.3550.3270.265
2026.03
0.320.2990.214
2025.03
0.3090.216-
2026.03
0.3090.3180.253
2025.03
0.2790.268-