Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HaluEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionHaluEval (test)
AUC-ROC97.1
126
Hallucination DetectionHaluEval
F1 Score83.6
75
Hallucination DetectionHaluEvalQA
ROC-AUC89
28
Hallucination DetectionHaluEval Dialogue latest (test)
Accuracy84.88
22
Hallucination DetectionHaluEval QA
Accuracy99.5
17
Hallucination DetectionHaluEval
Dialogue Score72.2
15
Factuality EvaluationHaluEval
Accuracy (Response)68.7
14
Question AnsweringHaluEval QA
Accuracy45.4
14
Hallucination Detection (Dialogue)HaluEval DA
F1 Score77.1
12
Question AnsweringHaluEval
EM68
12
Hallucination DetectionHaluEval Sum
F1 Score65.9
12
Grounded Text GenerationHaluEval
F1 Score72.66
11
GroundednessHaluEval
Kendall's Tau0.78
11
Hallucination DetectionHaluEval QA (test)
TPR78.9
8
Hallucination DetectionHaluEval Summarization (Starling-LM-7B-alpha)
TPR81
7
Question AnsweringHaluEval
Accuracy31
6
Hallucination EvaluationHaluEval
Average Score23.5
6
Hallucination DetectionHaluEval Summarization
Accuracy50
6
Instruction FollowingHaluEval QAmis (test)
Failure Rate0.0078
6
Instruction FollowingHaluEval (test)
Failure Rate (Sum)0.36
6
Question AnsweringHaluEval qa_samples
F1 Score86.7
5
Hallucination RegenerationHaluEval QA
Accuracy69.45
5
Question AnsweringHaluEval
nAUPC11.3
4
Factual ReasoningHaluEval General
Baseline Wins30
2
LLM Hallucination DetectionHaluEval (random sample of 1,000 text pairs)
Recall95.3
1
Showing 25 of 28 rows