Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Completeness on ALCE
Loading...
0.47
Kendall's Tau
Jury-on-Demand
0.0644
0.1697
0.275
0.3803
Dec 1, 2025
Kendall's Tau
Std Dev (ALCE)
Updated 4d ago
Evaluation Results
Method
Method
Links
Kendall's Tau
Std Dev (ALCE)
Jury-on-Demand
2025.12
0.47
0.07
GPT-OSS-20B
Judge Model=GPT-OSS-20B
2025.12
0.4
0.07
Claude 3.7
Judge Model=Claude 3.7
2025.12
0.38
0.09
GPT-OSS-120B
Judge Model=GPT-OSS-120B
2025.12
0.36
0.07
Gemini 2.0 Flash
Judge Model=Gemini 2.0...
2025.12
0.34
0.1
Gemini 2.5 Pro
Judge Model=Gemini 2.5...
2025.12
0.34
0.08
Gemma 3
Judge Model=Gemma 3
2025.12
0.33
0.07
DeepSeek R1
Judge Model=DeepSeek R1
2025.12
0.32
0.09
Gemini 2.5 Flash
Judge Model=Gemini 2.5...
2025.12
0.3
0.1
LLAMA 3.2
Judge Model=LL-3.2
2025.12
0.16
0.1
Phi 4
Judge Model=Phi 4
2025.12
0.08
0.1
Feedback
Search any
task
Search any
task