Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Pair-wise comparison on MTBench Human
Loading...
88.9
Accuracy
CCE@16
75.588
79.044
82.5
85.956
Feb 18, 2025
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
CCE@16
Model=Qwen 2.5 72B-Ins...
2025.02
88.9
CCE@16
Model=GPT-4o
2025.02
83.6
CCE@16
Model=Llama 3.3 70B-In...
2025.02
83.5
CCE-random@16
Model=GPT-4o
2025.02
83.1
16-Criteria
Model=GPT-4o
2025.02
82.8
Agg@16
Model=GPT-4o
2025.02
82.6
Maj@16
Model=GPT-4o
2025.02
82.4
Vanilla
Model=GPT-4o
2025.02
82.1
CCE@16
Model=Qwen 2.5 32B-Ins...
2025.02
82.1
LongPrompt
Model=GPT-4o
2025.02
81.8
EvalPlan
Model=GPT-4o
2025.02
81.4
Vanilla
Model=Llama 3.3 70B-In...
2025.02
81.1
Vanilla
Model=Qwen 2.5 72B-Ins...
2025.02
79.5
Vanilla
Model=Qwen 2.5 32B-Ins...
2025.02
79
CCE@16
Model=Qwen 2.5 7B-Inst...
2025.02
76.7
Vanilla
Model=Qwen 2.5 7B-Inst...
2025.02
76.1
Feedback
Search any
task
Search any
task