| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | HaluEval (test) | AUC-ROC97.1 | 126 | |
| Hallucination Detection | HaluEval | F1 Score83.6 | 75 | |
| Hallucination Detection | HaluEvalQA | ROC-AUC89 | 28 | |
| Hallucination Detection | HaluEval Dialogue latest (test) | Accuracy84.88 | 22 | |
| Hallucination Detection | HaluEval QA | Accuracy99.5 | 17 | |
| Hallucination Detection | HaluEval | Dialogue Score72.2 | 15 | |
| Factuality Evaluation | HaluEval | Accuracy (Response)68.7 | 14 | |
| Question Answering | HaluEval QA | Accuracy45.4 | 14 | |
| Hallucination Detection (Dialogue) | HaluEval DA | F1 Score77.1 | 12 | |
| Question Answering | HaluEval | EM68 | 12 | |
| Hallucination Detection | HaluEval Sum | F1 Score65.9 | 12 | |
| Grounded Text Generation | HaluEval | F1 Score72.66 | 11 | |
| Groundedness | HaluEval | Kendall's Tau0.78 | 11 | |
| Hallucination Detection | HaluEval QA (test) | TPR78.9 | 8 | |
| Hallucination Detection | HaluEval Summarization (Starling-LM-7B-alpha) | TPR81 | 7 | |
| Question Answering | HaluEval | Accuracy31 | 6 | |
| Hallucination Evaluation | HaluEval | Average Score23.5 | 6 | |
| Hallucination Detection | HaluEval Summarization | Accuracy50 | 6 | |
| Instruction Following | HaluEval QAmis (test) | Failure Rate0.0078 | 6 | |
| Instruction Following | HaluEval (test) | Failure Rate (Sum)0.36 | 6 | |
| Question Answering | HaluEval qa_samples | F1 Score86.7 | 5 | |
| Hallucination Regeneration | HaluEval QA | Accuracy69.45 | 5 | |
| Question Answering | HaluEval | nAUPC11.3 | 4 | |
| Factual Reasoning | HaluEval General | Baseline Wins30 | 2 | |
| LLM Hallucination Detection | HaluEval (random sample of 1,000 text pairs) | Recall95.3 | 1 |