| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | HaluEval (test) | AUC-ROC98.55 | 176 | |
| Hallucination Detection | HaluEval | AUROC1 | 131 | |
| Hallucination Evaluation | HaluEval | Accuracy (ACC)100 | 51 | |
| Hallucination Detection | HaluEvalQA | ROC-AUC89 | 39 | |
| Factuality Evaluation | HaluEval Sum (500 items) | MC1 Score63 | 30 | |
| Factuality Evaluation | HaluEval QA (500 items) | MC1 Score86.4 | 30 | |
| Causal Faithfulness Evaluation | HaluEval Adversarial | nAOPC100 | 28 | |
| Hallucination Detection | HaluEval Dialogue latest (test) | Accuracy84.88 | 22 | |
| Hallucination Detection | HaluEval QA | Accuracy99.5 | 17 | |
| Hallucination Detection | HaluEval Gemini outputs (test) | AUROC0.571 | 15 | |
| Hallucination Detection | HaluEval GPT outputs (test) | AUROC0.582 | 15 | |
| Hallucination Detection | HaluEval Llama outputs (test) | AUROC0.704 | 15 | |
| Hallucination Detection | HaluEval | Dialogue Score72.2 | 15 | |
| Factuality Evaluation | HaluEval | Accuracy (Response)68.7 | 14 | |
| Question Answering | HaluEval QA | Accuracy45.4 | 14 | |
| Hallucination Detection | HaluEval held-out 50% (test) | AUROC69.9 | 12 | |
| Hallucination Detection (Dialogue) | HaluEval DA | F1 Score77.1 | 12 | |
| Question Answering | HaluEval | EM68 | 12 | |
| Hallucination Detection | HaluEval Sum | F1 Score65.9 | 12 | |
| Grounded Text Generation | HaluEval | F1 Score72.66 | 11 | |
| Groundedness | HaluEval | Kendall's Tau0.78 | 11 | |
| Generative Question Answering | HaluEval (test) | HALL Rate50.33 | 10 | |
| Hallucination detection | HaluEval | HaluEval Delta21.3 | 10 | |
| Hallucination Detection | HaluEval (in-distribution) | AUC92.86 | 9 | |
| Hallucination Detection | HaluEval QA (test) | TPR78.9 | 8 |