Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM-as-a-judge evaluation on FB Bench (Feedback Bench)

0.949Pearson's r

Qwen3-8B TRACT

0.166920.369960.5730.77604Mar 6, 2025May 7, 2025Jul 9, 2025Sep 10, 2025Nov 11, 2025Jan 13, 2026Mar 17, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.03
0.9490.9470.843
2026.03
0.940.9420.835
2026.03
0.9390.9370.829
2026.03
0.9370.9370.828
2025.03
0.9320.93-
2026.03
0.9320.9340.825
2025.03
0.9310.93-
2025.03
0.920.918-
2025.03
0.920.917-
2026.03
0.920.9210.857
2025.03
0.9190.917-
2026.03
0.9110.9170.859
2025.03
0.890.891-
2025.03
0.8790.88-
2026.03
0.8790.880.763
2025.03
0.8730.873-
2025.03
0.8720.872-
2026.03
0.860.8580.771
2025.03
0.8570.857-
2026.03
0.8540.8650.729
2026.03
0.8530.8530.729
2026.03
0.8470.8490.767
2025.03
0.8450.847-
2026.03
0.8450.8470.765
2026.03
0.8430.8550.73
2026.03
0.8370.8430.702
2025.03
0.8350.834-
2026.03
0.8310.8330.748
2025.03
0.6830.689-
2025.03
0.6740.684-
2026.03
0.6340.7080.567
2026.03
0.5670.6540.541
2026.03
0.5660.6270.539
2026.03
0.5630.5210.453
2025.03
0.3810.376-
2025.03
0.1970.175-