| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BIGGEN | Human Agreement78.33 | 41 | 2d ago | ||
| AlpacaEval | Fine-tuned Rubric Generator | Human Agreement72.4 | 37 | 2d ago | |
| MT-Bench | Fine-tuned Rubric Generator | Human Agreement Rate83.69 | 9 | 2d ago | |
| HH-RLHF (test) | pairwise evaluator | Test Accuracy95.2 | 4 | 1mo ago |