Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Statistical Comparison on TruthfulQA (Mean Improvement, P-Value)
Loading...
0.33
Mean Improvement
PASf
0.07
0.1375
0.205
0.2725
Sep 25, 2025
Mean Improvement
95% CI Lower Bound
P-Value
Updated 16d ago
Evaluation Results
Method
Method
Links
Mean Improvement
95% CI Lower Bound
P-Value
PASf
Model=DeepSeek-R1-Dist...
2025.09
0.33
0.27
0
PASf
Model=Llama-3.1-8B-Ins...
2025.09
0.27
0.24
0
iPASwo
Model=DeepSeek-R1-Dist...
2025.09
0.23
0.16
0
PASf
Model=Nous-Hermes-2-Mi...
2025.09
0.21
0.17
0
iPASwo
Model=Nous-Hermes-2-Mi...
2025.09
0.17
0.13
0
iPASwo
Model=Llama-3.1-8B-Ins...
2025.09
0.16
0.14
0
iPASa
Model=DeepSeek-R1-Dist...
2025.09
0.15
0.07
0
iPASa
Model=Llama-3.1-8B-Ins...
2025.09
0.13
0.11
0
iPASa
Model=Nous-Hermes-2-Mi...
2025.09
0.08
0.04
0
Feedback
Search any
task
Search any
task