| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score62.7 | 447 | |
| Instruction Following | MT-Bench | MT-Bench Score9.32 | 215 | |
| Multi-turn Dialogue Evaluation | MT-Bench-zh | Score6.34 | 90 | |
| Multi-turn Dialogue | MT-Bench | Speedup4.1 | 80 | |
| Instruction Following | MT-Bench zh | Score6.83 | 60 | |
| Chat | MT-Bench | MT-Bench Score8.91 | 58 | |
| Multi-turn dialogue | MT-bench | Kendall's Tau5.25 | 54 | |
| Instruction Following | MT-bench v1.0 (test) | MT-Bench Score61.2 | 52 | |
| Dialogue | MT-Bench (test) | GPT-4 Score8.36 | 46 | |
| Judge Agreement | MT-bench Second Turn 1.0 | Agreement Rate95 | 46 | |
| Multi-turn Instruction Following | MT Bench | MT-Bench Score (GPT-4)9.16 | 44 | |
| Generative Inference | MT-Bench | Speedup2.73 | 44 | |
| Multi-turn conversation | MT-bench | SR5.23 | 43 | |
| Multi-turn conversation | MT-Bench | Conversation Rating (1-10)8.7 | 41 | |
| Multi-round conversation | MT-Bench | Tokens Per Second274.32 | 40 | |
| LLM-as-a-judge evaluation | MT-Bench | Pearson's r0.689 | 36 | |
| LLM Judge Agreement | MT-bench First Turn | Agreement Rate0.97 | 34 | |
| Long-form Dialogue | MT-Bench+ | Quality Score90.5 | 32 | |
| Judge Agreement | MT-bench Second Turn | Agreement95 | 32 | |
| Dialogue | MT-Bench | MT-Bench Score9.3 | 29 | |
| Speculative Sampling | MT-Bench | Average Acceptance Length4.13 | 28 | |
| Conversational Ability | MT-Bench | MT-Bench Score7.58 | 28 | |
| Instruction Following | MT-Bench (test) | Overall Score6.52 | 27 | |
| Code generation | MT-Bench (test) | Speedup Ratio3.934 | 26 | |
| Multi-turn instruction following | MT-Bench High-Variance (Top 20%) | Reward Score7.54 | 26 |