Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MT-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-turn Dialogue EvaluationMT-Bench
Overall Score62.7
331
Instruction FollowingMT-Bench
MT-Bench Score9.32
189
Multi-turn Dialogue EvaluationMT-Bench-zh
Score6.34
90
Instruction FollowingMT-Bench zh
Score6.83
60
Multi-turn dialogueMT-bench
Kendall's Tau5.25
54
Instruction FollowingMT-bench v1.0 (test)
MT-Bench Score61.2
52
Multi-turn DialogueMT-Bench
Speedup4.1
47
DialogueMT-Bench (test)
GPT-4 Score8.36
46
Judge AgreementMT-bench Second Turn 1.0
Agreement Rate95
46
Multi-turn conversationMT-Bench
Conversation Rating (1-10)8.7
41
Multi-round conversationMT-Bench
Tokens Per Second274.32
40
LLM Judge AgreementMT-bench First Turn
Agreement Rate0.97
34
Long-form DialogueMT-Bench+
Quality Score90.5
32
Judge AgreementMT-bench Second Turn
Agreement95
32
ChatMT-Bench
MT-Bench Score8.1
30
Instruction FollowingMT-Bench (test)
Overall Score6.52
27
Generative InferenceMT-Bench
Speedup2.73
26
self-affirmationMT-Bench-101
Success Rate0.58
25
Output DiversityMT-Bench
Lexical Diversity Score48.36
20
Response Quality EvaluationMT-Bench
Average Response Quality8.71
19
ChatMT-Bench 1.0 (test)
MT-Bench Score8
19
LLM UnlearningMT-Bench
Fluency5.62
18
Pairwise LLM JudgingMT-Bench
Coverage100
16
Rerouting DetectionMT-Bench (test)
Accuracy100
16
LLM-as-a-judge evaluationMT-Bench
Pearson's r0.672
16
Showing 25 of 62 rows