WildHallu

Benchmarks

Task Name	Dataset Name	SOTA Result
Long-form generation factuality and uncertainty estimation	WildHallu (test)	Factuality Score0.86	14
Confidence Estimation (Freeform Tagging)	WildHallu	Brier Score (BS)4.1	11
Confidence Estimation (Iterative Tagging)	WildHallu	Brier Score (BS)5.7	9

Showing 3 of 3 rows