Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MT-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-turn Dialogue EvaluationMT-Bench
Overall Score62.7
447
Instruction FollowingMT-Bench
MT-Bench Score9.32
215
Multi-turn Dialogue EvaluationMT-Bench-zh
Score6.34
90
Multi-turn DialogueMT-Bench
Speedup4.1
80
Instruction FollowingMT-Bench zh
Score6.83
60
ChatMT-Bench
MT-Bench Score8.91
58
Multi-turn dialogueMT-bench
Kendall's Tau5.25
54
Instruction FollowingMT-bench v1.0 (test)
MT-Bench Score61.2
52
DialogueMT-Bench (test)
GPT-4 Score8.36
46
Judge AgreementMT-bench Second Turn 1.0
Agreement Rate95
46
Multi-turn Instruction FollowingMT Bench
MT-Bench Score (GPT-4)9.16
44
Generative InferenceMT-Bench
Speedup2.73
44
Multi-turn conversationMT-bench
SR5.23
43
Multi-turn conversationMT-Bench
Conversation Rating (1-10)8.7
41
Multi-round conversationMT-Bench
Tokens Per Second274.32
40
LLM-as-a-judge evaluationMT-Bench
Pearson's r0.689
36
LLM Judge AgreementMT-bench First Turn
Agreement Rate0.97
34
Long-form DialogueMT-Bench+
Quality Score90.5
32
Judge AgreementMT-bench Second Turn
Agreement95
32
DialogueMT-Bench
MT-Bench Score9.3
29
Speculative SamplingMT-Bench
Average Acceptance Length4.13
28
Conversational AbilityMT-Bench
MT-Bench Score7.58
28
Instruction FollowingMT-Bench (test)
Overall Score6.52
27
Code generationMT-Bench (test)
Speedup Ratio3.934
26
Multi-turn instruction followingMT-Bench High-Variance (Top 20%)
Reward Score7.54
26
Showing 25 of 90 rows