Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HaluEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionHaluEval (test)
AUC-ROC98.55
176
Hallucination DetectionHaluEval
AUROC1
131
Hallucination EvaluationHaluEval
Accuracy (ACC)100
51
Hallucination DetectionHaluEvalQA
ROC-AUC89
39
Factuality EvaluationHaluEval Sum (500 items)
MC1 Score63
30
Factuality EvaluationHaluEval QA (500 items)
MC1 Score86.4
30
Causal Faithfulness EvaluationHaluEval Adversarial
nAOPC100
28
Hallucination DetectionHaluEval Dialogue latest (test)
Accuracy84.88
22
Hallucination DetectionHaluEval QA
Accuracy99.5
17
Hallucination DetectionHaluEval Gemini outputs (test)
AUROC0.571
15
Hallucination DetectionHaluEval GPT outputs (test)
AUROC0.582
15
Hallucination DetectionHaluEval Llama outputs (test)
AUROC0.704
15
Hallucination DetectionHaluEval
Dialogue Score72.2
15
Factuality EvaluationHaluEval
Accuracy (Response)68.7
14
Question AnsweringHaluEval QA
Accuracy45.4
14
Hallucination DetectionHaluEval held-out 50% (test)
AUROC69.9
12
Hallucination Detection (Dialogue)HaluEval DA
F1 Score77.1
12
Question AnsweringHaluEval
EM68
12
Hallucination DetectionHaluEval Sum
F1 Score65.9
12
Grounded Text GenerationHaluEval
F1 Score72.66
11
GroundednessHaluEval
Kendall's Tau0.78
11
Generative Question AnsweringHaluEval (test)
HALL Rate50.33
10
Hallucination detectionHaluEval
HaluEval Delta21.3
10
Hallucination DetectionHaluEval (in-distribution)
AUC92.86
9
Hallucination DetectionHaluEval QA (test)
TPR78.9
8
Showing 25 of 43 rows