| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | HaluEval (test) | AUC-ROC97.1 | 126 | |
| Hallucination Detection | HaluEval Dialogue latest (test) | Accuracy84.88 | 22 | |
| Hallucination Detection | HaluEval | Dialogue Score72.2 | 15 | |
| Question Answering | HaluEval QA | Accuracy45.4 | 14 | |
| Question Answering | HaluEval | EM68 | 12 | |
| Grounded Text Generation | HaluEval | F1 Score72.66 | 11 | |
| Groundedness | HaluEval | Kendall's Tau0.78 | 11 | |
| Hallucination Detection | HaluEval QA (test) | TPR78.9 | 8 | |
| Hallucination Detection | HaluEval Summarization (Starling-LM-7B-alpha) | TPR81 | 7 | |
| Hallucination Detection | HaluEval Sum | Accuracy (H)37.46 | 7 | |
| Hallucination Detection | HaluEval Summarization | Accuracy50 | 6 | |
| Instruction Following | HaluEval QAmis (test) | Failure Rate0.0078 | 6 | |
| Instruction Following | HaluEval (test) | Failure Rate (Sum)0.36 | 6 | |
| Hallucination Detection | HaluEval | AUROC0.8021 | 6 | |
| Question Answering | HaluEval qa_samples | F1 Score86.7 | 5 | |
| Hallucination Regeneration | HaluEval QA | Accuracy69.45 | 5 | |
| Hallucination Detection | HaluEval Dialogue (test) | Groundedness (Gamma)0.287 | 1 | |
| Hallucination Detection | HaluEval ChatGPT (test) | Coverage94.5 | 1 |