Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Preference Evaluation on PPE Preference (test)
Loading...
0.0434
Kuiper Statistic
Probe
0.031004
0.114677
0.19835
0.282023
Dec 23, 2025
Kuiper Statistic
ECE
Updated 4d ago
Evaluation Results
Method
Method
Links
Kuiper Statistic
ECE
Probe
Model=LLAMA Scout (109B)
2025.12
0.0434
5.82
Verbalized
Model=LLAMA Scout (109B)
2025.12
0.1179
8.92
Probe
Model=GPT-OSS 20B
2025.12
0.1461
17.13
Consistency
Model=GPT-OSS 20B, N=1...
2025.12
0.2045
19.22
Majority
Model=GPT-OSS 20B, N=1...
2025.12
0.2056
19.34
Verbalized
Model=GPT-OSS 20B
2025.12
0.2265
22.33
Majority
Model=LLAMA Scout (109...
2025.12
0.3506
32.46
Consistency
Model=LLAMA Scout (109...
2025.12
0.3533
32.87
Feedback
Search any
task
Search any
task