Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RAGTruth

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionRAGTruth (test)
AUROC0.9096
99
Hallucination DetectionRAGTruth
AUROC0.8535
58
Hallucination detectionRAGTruth CNN/DM (subsample)
AUROC0.69
45
Hallucination detectionRAGTruth MS MARCO (subsample)
AUROC0.77
45
Hallucination DetectionRAGTruth RT-Summ 1.0 (test)
F1 Score0.6966
30
Hallucination DetectionRAGTruth RT-D2T 1.0 (test)
F1 Score0.7383
30
Hallucination DetectionRAGTruth RT-QA 1.0 (test)
F1 Score0.7885
30
Hallucination DetectionRAGTruth Llama2-13B (test)
Acc83.33
21
Hallucination DetectionRAGTruth Llama2-7B (test)
Accuracy75.76
21
Hallucination DetectionRAGTruth LLaMA3-8B
Recall78.6
19
Hallucination DetectionRAGTruth LLaMA2-13B
Recall80.68
19
Hallucination DetectionRAGTruth LLaMA2-7B
Recall0.8328
19
SummarizationRAGTruth summarization (test)
ROUGE-152
18
Question AnsweringRAGTruth
F1 Score45.89
17
Hallucination DetectionRAGTruth summarization task
Precision77
14
Response-level Hallucination DetectionRAGTruth QA
AUROC91.89
13
Span-level Hallucination DetectionRagTruth-Avg (test)
F1 Score76.63
12
Grounded Text GenerationRAGTruth
F1 Score33.14
11
GroundednessRagTruth
Kendall's Tau0.57
11
Faithfulness detectionRAGTruth
Accuracy90.3
10
Hallucination DetectionRAGTruth
Summary Consistency Rate92.67
10
Hallucination DetectionRAGTruth Llama-13B
Recall89.47
10
Hallucination DetectionRAGTruth Llama-7B
Recall92.54
10
Token-Level Hallucination DetectionRAGTruth QA
AUROC95.6
7
Hallucination MitigationRAGTruth
Hallucination Rate32.7
6
Showing 25 of 33 rows