Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Confidence Estimation (Iterative Tagging) on WildHallu
Loading...
5.7
Brier Score (BS)
LOVEC-GRPO
4.976
9.863
14.75
19.637
May 29, 2025
Brier Score (BS)
ECE-M
Scaled Calibration (SC)
Updated 19d ago
Evaluation Results
Method
Method
Links
Brier Score (BS)
ECE-M
Scaled Calibration (SC)
LOVEC-GRPO
Tagging Format=Iterati...
2025.05
5.7
2.5
57
LOVEC-DPO
Tagging Format=Iterati...
2025.05
6
5
60.4
LOVEC-SFT
Tagging Format=Iterati...
2025.05
9.1
15.2
51.1
Vanilla
Backbone=Llama3-8B-Ins...
2025.05
10.8
6
9.1
LUQ
Backbone=Llama3-8B-Ins...
2025.05
14.5
21.5
56.8
p(true)-ft
Backbone=Llama3-8B-Ins...
2025.05
16.4
19.5
47.5
Self-Cons
Backbone=Llama3-8B-Ins...
2025.05
16.5
24.3
47.8
Verb-Conf
Backbone=Llama3-8B-Ins...
2025.05
20.3
22.1
13.4
p(true)
Backbone=Llama3-8B-Ins...
2025.05
23.8
23.6
15.8
Feedback
Search any
task
Search any
task