Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Correctness Evaluation on PPE Correctness (test)
Loading...
0.0643
Kuiper
Probe
0.054024
0.123387
0.19275
0.262113
Dec 23, 2025
Kuiper
ECE
Updated 4d ago
Evaluation Results
Method
Method
Links
Kuiper
ECE
Probe
Model=LLAMA Scout (109B)
2025.12
0.0643
0.0684
Probe
Model=GPT-OSS 20B
2025.12
0.0689
0.1193
Consistency
Model=GPT-OSS 20B, N=1...
2025.12
0.0932
0.098
Majority
Model=GPT-OSS 20B, N=1...
2025.12
0.099
0.1034
Verbalized
Model=GPT-OSS 20B
2025.12
0.1462
0.152
Verbalized
Model=LLAMA Scout (109B)
2025.12
0.1931
0.1597
Majority
Model=LLAMA Scout (109...
2025.12
0.3197
0.294
Consistency
Model=LLAMA Scout (109...
2025.12
0.3212
0.2978
Feedback
Search any
task
Search any
task