Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Confidence Estimation (Freeform Tagging) on WildHallu
Loading...
4.1
Brier Score (BS)
LOVEC-DPO
3.364
8.332
13.3
18.268
May 29, 2025
Brier Score (BS)
ECE-M
Scaled Calibration (SC)
Updated 19d ago
Evaluation Results
Method
Method
Links
Brier Score (BS)
ECE-M
Scaled Calibration (SC)
LOVEC-DPO
Base Model=Gemma-2-9B-...
2025.05
4.1
1.3
51.8
LOVEC-GRPO
Tagging Format=Freefor...
2025.05
6
8.2
63.1
LOVEC-DPO
Tagging Format=Freefor...
2025.05
6.3
5.4
62.1
LOVEC-GRPO
Base Model=Gemma-2-9B-...
2025.05
7.3
5.6
52.2
LOVEC-SFT
Base Model=Gemma-2-9B-...
2025.05
8
12.2
36.1
LOVEC-SFT
Tagging Format=Freefor...
2025.05
8.9
15.1
58.8
luq
Base Model=Gemma-2-9B-...
2025.05
11.9
16.3
50
Self-Cons
Base Model=Gemma-2-9B-...
2025.05
13.4
17.7
43.2
Verb-Conf
Base Model=Gemma-2-9B-...
2025.05
18.5
19.2
35.1
p(true)
Base Model=Gemma-2-9B-...
2025.05
19.3
22.8
25.4
Vanilla
Base Model=Gemma-2-9B-...
2025.05
22.5
26.3
28.9
Feedback
Search any
task
Search any
task