Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-as-a-Judge Evaluation on FLASK

0.589Pearson's r

Qwen3-32B REAL (ours)

0.184440.289470.39450.49953Mar 6, 2025May 7, 2025Jul 9, 2025Sep 10, 2025Nov 11, 2025Jan 13, 2026Mar 17, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
0.5890.5860.474
2026.03
0.560.5410.411
2026.03
0.5430.6040.472
2026.03
0.5380.5390.431
2026.03
0.5250.5140.392
2026.03
0.5230.5240.395
2026.03
0.5210.5290.399
2025.03
0.5180.501-
2026.03
0.5160.5050.379
2026.03
0.5140.5130.385
2025.03
0.5120.493-
2026.03
0.5120.4930.405
2025.03
0.5090.502-
2026.03
0.5070.50.372
2025.03
0.5060.493-
2025.03
0.50.493-
2026.03
0.4920.5010.375
2025.03
0.4750.484-
2025.03
0.4680.436-
2026.03
0.4660.4290.346
2026.03
0.4570.4570.365
2026.03
0.450.4830.385
2025.03
0.4480.437-
2026.03
0.4480.480.401
2025.03
0.4350.433-
2026.03
0.4250.4370.323
2025.03
0.4180.419-
2026.03
0.4180.4190.315
2026.03
0.4150.4190.341
2025.03
0.4130.407-
2025.03
0.4120.445-
2025.03
0.3580.346-
2025.03
0.3550.361-
2026.03
0.270.2320.187
2025.03
0.2280.168-
2025.03
0.20.149-