Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WildHallu

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-form generation factuality and uncertainty estimationWildHallu (test)
Factuality Score0.86
14
Confidence Estimation (Freeform Tagging)WildHallu
Brier Score (BS)4.1
11
Confidence Estimation (Iterative Tagging)WildHallu
Brier Score (BS)5.7
9
Showing 3 of 3 rows