Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Pairwise Evaluation on MT-Bench
Loading...
83.69
Human Agreement Rate
Fine-tuned Rubric Generator
80.4244
81.2722
82.12
82.9678
May 28, 2026
Human Agreement Rate
Updated 2d ago
Evaluation Results
Method
Method
Links
Human Agreement Rate
Fine-tuned Rubric Generator
Judge=Claude Sonnet 4,...
2026.05
83.69
Fine-tuned Rubric Generator
Judge=Claude Sonnet 4,...
2026.05
83.35
Fine-tuned Rubric Generator
Judge=Llama 3.1 70B, I...
2026.05
82.93
Fine-tuned Rubric Generator
Judge=Qwen3 14B, Itera...
2026.05
82.87
Fine-tuned Rubric Generator
Judge=Llama 3.1 70B, I...
2026.05
82.72
Fine-tuned Rubric Generator
Judge=Qwen3 14B, Itera...
2026.05
82.62
Highest Training-free
Judge=Claude Sonnet 4
2026.05
81.72
Highest Training-free
Judge=Llama 3.1 70B
2026.05
80.98
Highest Training-free
Judge=Qwen3 14B
2026.05
80.55
Feedback
Search any
task
Search any
task