Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
LLM-as-a-Judge on DDI (test)
Loading...
59.03
EM (Δ)
GPT-4o-Mini
28.1316
36.1533
44.175
52.1967
Jun 1, 2025
EM (Δ)
RMSE (∇)
Updated 4d ago
Evaluation Results
Method
Method
Links
EM (Δ)
RMSE (∇)
GPT-4o-Mini
LLM-Generator Response...
2025.06
59.03
1.84
Gemini-Flash
LLM-Generator Response...
2025.06
47.12
2.11
Qwen-2.5-7B-Instruct
LLM-Generator Response...
2025.06
46.6
2.15
Phi-3.5-Mini-3.8B-Instruct
LLM-Generator Response...
2025.06
43.06
2.19
Deepseek-R1-Qwen-7B
LLM-Generator Response...
2025.06
42.67
3.07
Deepseek-R1-LLaMA-8B
LLM-Generator Response...
2025.06
42.15
4.16
Claude-3-Haiku
LLM-Generator Response...
2025.06
31.15
2.7
LLaMA-3.1-8B-Instruct
LLM-Generator Response...
2025.06
29.32
2.95
Feedback
Search any
task
Search any
task