| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | RAGTruth (test) | AUROC0.9096 | 83 | |
| Hallucination Detection | RAGTruth | AUROC0.8535 | 36 | |
| Hallucination Detection | RAGTruth RT-Summ 1.0 (test) | F1 Score0.6966 | 30 | |
| Hallucination Detection | RAGTruth RT-D2T 1.0 (test) | F1 Score0.7383 | 30 | |
| Hallucination Detection | RAGTruth RT-QA 1.0 (test) | F1 Score0.7885 | 30 | |
| Hallucination Detection | RAGTruth Llama2-13B (test) | Acc83.33 | 21 | |
| Hallucination Detection | RAGTruth Llama2-7B (test) | Accuracy75.76 | 21 | |
| Hallucination Detection | RAGTruth LLaMA3-8B | Recall78.6 | 19 | |
| Hallucination Detection | RAGTruth LLaMA2-13B | Recall80.68 | 19 | |
| Hallucination Detection | RAGTruth LLaMA2-7B | Recall0.8328 | 19 | |
| Summarization | RAGTruth summarization (test) | ROUGE-152 | 18 | |
| Question Answering | RAGTruth | F1 Score45.89 | 17 | |
| Hallucination Detection | RAGTruth summarization task | Precision77 | 14 | |
| Span-level Hallucination Detection | RagTruth-Avg (test) | F1 Score76.63 | 12 | |
| Grounded Text Generation | RAGTruth | F1 Score33.14 | 11 | |
| Groundedness | RagTruth | Kendall's Tau0.57 | 11 | |
| Hallucination Detection | RAGTruth Llama-13B | Recall89.47 | 10 | |
| Hallucination Detection | RAGTruth Llama-7B | Recall92.54 | 10 | |
| Hallucination detection | RAGTruth Summarization Mistral-7b | AUCROC74.45 | 4 | |
| Hallucination detection | RAGTruth Summarization (Llama-2-13b) | AUCROC72.9 | 4 | |
| Hallucination detection | RAGTruth Summarization (Llama-2-7b) | AUCROC73.37 | 4 |