| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score62.7 | 331 | |
| Instruction Following | MT-Bench | MT-Bench Score9.32 | 189 | |
| Multi-turn Dialogue Evaluation | MT-Bench-zh | Score6.34 | 90 | |
| Instruction Following | MT-Bench zh | Score6.83 | 60 | |
| Multi-turn dialogue | MT-bench | Kendall's Tau5.25 | 54 | |
| Instruction Following | MT-bench v1.0 (test) | MT-Bench Score61.2 | 52 | |
| Multi-turn Dialogue | MT-Bench | Speedup4.1 | 47 | |
| Dialogue | MT-Bench (test) | GPT-4 Score8.36 | 46 | |
| Judge Agreement | MT-bench Second Turn 1.0 | Agreement Rate95 | 46 | |
| Multi-turn conversation | MT-Bench | Conversation Rating (1-10)8.7 | 41 | |
| Multi-round conversation | MT-Bench | Tokens Per Second274.32 | 40 | |
| LLM Judge Agreement | MT-bench First Turn | Agreement Rate0.97 | 34 | |
| Long-form Dialogue | MT-Bench+ | Quality Score90.5 | 32 | |
| Judge Agreement | MT-bench Second Turn | Agreement95 | 32 | |
| Chat | MT-Bench | MT-Bench Score8.1 | 30 | |
| Instruction Following | MT-Bench (test) | Overall Score6.52 | 27 | |
| Generative Inference | MT-Bench | Speedup2.73 | 26 | |
| self-affirmation | MT-Bench-101 | Success Rate0.58 | 25 | |
| Output Diversity | MT-Bench | Lexical Diversity Score48.36 | 20 | |
| Response Quality Evaluation | MT-Bench | Average Response Quality8.71 | 19 | |
| Chat | MT-Bench 1.0 (test) | MT-Bench Score8 | 19 | |
| LLM Unlearning | MT-Bench | Fluency5.62 | 18 | |
| Pairwise LLM Judging | MT-Bench | Coverage100 | 16 | |
| Rerouting Detection | MT-Bench (test) | Accuracy100 | 16 | |
| LLM-as-a-judge evaluation | MT-Bench | Pearson's r0.672 | 16 |