Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-evidence Aggregation on Multi-evidence Aggregation Dataset (test)
Loading...
0.036
ECE
Baseline (Uniform Avg)
0.03
0.0705
0.111
0.1515
Mar 13, 2026
ECE
Test Accuracy
Train-Test Gap
Epistemic Error
Robustness Score
MC Error
Updated 1mo ago
Evaluation Results
Method
Method
Links
ECE
Test Accuracy
Train-Test Gap
Epistemic Error
Robustness Score
MC Error
Baseline (Uniform Avg)
K (Number of evidence...
2026.03
0.036
85
-
-
50
-
LPF-Learned
K (Number of evidence...
2026.03
0.058
95.4
0.0085
0.002
12
0.013
LPF-SPN
K (Number of evidence...
2026.03
0.186
92
-
0.002
12
0.013
Feedback
Search any
task
Search any
task