Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MT-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-turn Dialogue EvaluationMT-Bench
Overall Score62.7
532
Instruction FollowingMT-Bench
MT-Bench Score9.32
287
Multi-turn Instruction FollowingMT Bench
MT-Bench Score (GPT-4)9.16
129
Multi-turn dialogueMT-Bench
MT-Bench Score76.19
126
Multi-turn ConversationMT-Bench
Average Score85.25
107
Multi-turn Dialogue EvaluationMT-Bench-zh
Score6.34
90
Multi-turn DialogueMT-Bench
Speedup4.1
80
Multi-turn ConversationMT-bench
Speedup4.64
76
ChatMT-Bench
MT-Bench Score8.91
73
Multi-turn Conversation EvaluationMT-bench
MT-Bench Score81.1
68
Dialogue EvaluationMT-Bench
Kendall's Tau (τ)4.01
62
Instruction FollowingMT-Bench zh
Score6.83
60
Multi-turn dialogueMT-bench
Kendall's Tau5.25
54
Speculative DecodingMT-bench
Tau (τ)6.06
53
Instruction FollowingMT-bench v1.0 (test)
MT-Bench Score61.2
52
AlignmentMT-Bench
MT-Bench Score9.12
49
DialogueMT-Bench (test)
GPT-4 Score8.36
46
Judge AgreementMT-bench Second Turn 1.0
Agreement Rate95
46
Multi-turn DialogueMT-Bench
Speedup3.22
44
LLM-as-a-JudgeMT-Bench
Accuracy81.4
44
Generative InferenceMT-Bench
Speedup2.73
44
Multi-turn conversationMT-bench
SR5.23
43
DialogueMT-Bench
MT-Bench Score9.3
41
Multi-turn conversationMT-Bench
Conversation Rating (1-10)8.7
41
Multi-round conversationMT-Bench
Tokens Per Second274.32
40
Showing 25 of 130 rows