Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pairwise Evaluation on MT-Bench

83.69Human Agreement Rate

Fine-tuned Rubric Generator

80.424481.272282.1282.9678May 28, 2026
Updated 2d ago

Evaluation Results

MethodLinks
2026.05
83.69
2026.05
83.35
2026.05
82.93
2026.05
82.87
2026.05
82.72
2026.05
82.62
2026.05
81.72
2026.05
80.98
2026.05
80.55