Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RAGTruth

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination DetectionRAGTruth (test)
AUROC0.9096
83
Hallucination detectionRAGTruth CNN/DM (subsample)
AUROC0.69
45
Hallucination detectionRAGTruth MS MARCO (subsample)
AUROC0.77
45
Hallucination DetectionRAGTruth
AUROC0.8535
36
Hallucination DetectionRAGTruth RT-Summ 1.0 (test)
F1 Score0.6966
30
Hallucination DetectionRAGTruth RT-D2T 1.0 (test)
F1 Score0.7383
30
Hallucination DetectionRAGTruth RT-QA 1.0 (test)
F1 Score0.7885
30
Hallucination DetectionRAGTruth Llama2-13B (test)
Acc83.33
21
Hallucination DetectionRAGTruth Llama2-7B (test)
Accuracy75.76
21
Hallucination DetectionRAGTruth LLaMA3-8B
Recall78.6
19
Hallucination DetectionRAGTruth LLaMA2-13B
Recall80.68
19
Hallucination DetectionRAGTruth LLaMA2-7B
Recall0.8328
19
SummarizationRAGTruth summarization (test)
ROUGE-152
18
Question AnsweringRAGTruth
F1 Score45.89
17
Hallucination DetectionRAGTruth summarization task
Precision77
14
Span-level Hallucination DetectionRagTruth-Avg (test)
F1 Score76.63
12
Grounded Text GenerationRAGTruth
F1 Score33.14
11
GroundednessRagTruth
Kendall's Tau0.57
11
Hallucination DetectionRAGTruth
Summary Consistency Rate92.67
10
Hallucination DetectionRAGTruth Llama-13B
Recall89.47
10
Hallucination DetectionRAGTruth Llama-7B
Recall92.54
10
Hallucination MitigationRAGTruth
Hallucination Rate32.7
6
Answer-level hallucination detectionRAGTruth Enhance
Precision100
5
Answer-level hallucination detectionRAGTruth ++
Precision100
5
Hallucination detectionRAGTruth Summarization Mistral-7b
AUCROC74.45
4
Showing 25 of 29 rows