Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM as a Judge on PRISM (test)
Loading...
58.9
Accuracy
SynthesizeMe
54.9896
56.0048
57.02
58.0352
Jun 5, 2025
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
SynthesizeMe
Backbone=GPT4o-mini
2025.06
58.9
SynthesizeMe
Backbone=Gemini-2.5-Flash
2025.06
58.36
SynthesizeMe
Backbone=Gemini-2.0-Flash
2025.06
57.8
SynthesizeMe
Backbone=Gemini-2.5-Pro
2025.06
57.76
SynthesizeMe
Backbone=Qwen2-30B-A3B
2025.06
57.37
Gemini-2.0-Flash
Prompting=Default
2025.06
56.97
SynthesizeMe
Backbone=Qwen2-32B
2025.06
56.74
Gemini-2.5-Flash
Prompting=Default
2025.06
56.66
Gemini-2.5-Pro
Prompting=Default
2025.06
56.51
Qwen2-30B-A3B
Prompting=Default
2025.06
56.32
Qwen2-32B
Prompting=Default
2025.06
56.22
GPT4o-mini
Prompting=Default
2025.06
56.07
SynthesizeMe
Backbone=Qwen2-8B
2025.06
55.95
Qwen2-8B
Prompting=Default
2025.06
55.14
Feedback
Search any
task
Search any
task