Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM as a Judge on Chatbot Arena (test)
Loading...
68.13
Accuracy
Gemini-2.5-Pro
59.5292
61.7621
63.995
66.2279
Jun 5, 2025
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
Gemini-2.5-Pro
Prompting=Default
2025.06
68.13
Gemini-2.5-Flash
Prompting=Default
2025.06
67.25
SynthesizeMe
Backbone=Gemini-2.5-Flash
2025.06
66.73
SynthesizeMe
Backbone=Gemini-2.5-Pro
2025.06
66.37
SynthesizeMe
Backbone=Qwen2-32B
2025.06
64.68
SynthesizeMe
Backbone=Gemini-2.0-Flash
2025.06
64.61
SynthesizeMe
Backbone=Qwen2-30B-A3B
2025.06
63.91
Gemini-2.0-Flash
Prompting=Default
2025.06
63.2
Qwen2-32B
Prompting=Default
2025.06
62.22
SynthesizeMe
Backbone=Qwen2-8B
2025.06
61.83
SynthesizeMe
Backbone=GPT4o-mini
2025.06
61.8
Qwen2-8B
Prompting=Default
2025.06
61.41
Qwen2-30B-A3B
Prompting=Default
2025.06
60.74
GPT4o-mini
Prompting=Default
2025.06
59.86
Feedback
Search any
task
Search any
task