| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-turn Dialogue Evaluation | MT-Bench | Overall Score62.7 | 532 | |
| Instruction Following | MT-Bench | MT-Bench Score9.32 | 287 | |
| Multi-turn Instruction Following | MT Bench | MT-Bench Score (GPT-4)9.16 | 129 | |
| Multi-turn dialogue | MT-Bench | MT-Bench Score76.19 | 126 | |
| Multi-turn Conversation | MT-Bench | Average Score85.25 | 107 | |
| Multi-turn Dialogue Evaluation | MT-Bench-zh | Score6.34 | 90 | |
| Multi-turn Dialogue | MT-Bench | Speedup4.1 | 80 | |
| Multi-turn Conversation | MT-bench | Speedup4.64 | 76 | |
| Chat | MT-Bench | MT-Bench Score8.91 | 73 | |
| Multi-turn Conversation Evaluation | MT-bench | MT-Bench Score81.1 | 68 | |
| Dialogue Evaluation | MT-Bench | Kendall's Tau (τ)4.01 | 62 | |
| Instruction Following | MT-Bench zh | Score6.83 | 60 | |
| Multi-turn dialogue | MT-bench | Kendall's Tau5.25 | 54 | |
| Speculative Decoding | MT-bench | Tau (τ)6.06 | 53 | |
| Instruction Following | MT-bench v1.0 (test) | MT-Bench Score61.2 | 52 | |
| Alignment | MT-Bench | MT-Bench Score9.12 | 49 | |
| Dialogue | MT-Bench (test) | GPT-4 Score8.36 | 46 | |
| Judge Agreement | MT-bench Second Turn 1.0 | Agreement Rate95 | 46 | |
| Multi-turn Dialogue | MT-Bench | Speedup3.22 | 44 | |
| LLM-as-a-Judge | MT-Bench | Accuracy81.4 | 44 | |
| Generative Inference | MT-Bench | Speedup2.73 | 44 | |
| Multi-turn conversation | MT-bench | SR5.23 | 43 | |
| Dialogue | MT-Bench | MT-Bench Score9.3 | 41 | |
| Multi-turn conversation | MT-Bench | Conversation Rating (1-10)8.7 | 41 | |
| Multi-round conversation | MT-Bench | Tokens Per Second274.32 | 40 |