| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Detection | TruthfulQA | AUC (ROC)0.9417 | 178 | |
| Question Answering | TruthfulQA | Accuracy86.6 | 164 | |
| Hallucination Detection | TruthfulQA (test) | AUC-ROC89.5 | 112 | |
| Truthfulness Evaluation | TruthfulQA | Accuracy75 | 108 | |
| Factuality Evaluation | TruthfulQA | MC294.3 | 103 | |
| Factuality | TruthfulQA | Accuracy83.41 | 97 | |
| Hallucination Detection | TruthfulQA | AUROC0.8851 | 91 | |
| Truthfulness | TruthfulQA | Truthfulness Accuracy97.55 | 86 | |
| Multiple-Choice | TruthfulQA | MC1 Accuracy58.5 | 83 | |
| Question Answering | TruthfulQA | Accuracy86.64 | 73 | |
| Selective Generation | TruthfulQA | ROC-AUC0.744 | 66 | |
| Question Answering | TruthfulQA | TruthfulQA Score63 | 61 | |
| Truthfulness Evaluation | TruthfulQA | T·I Score84.7 | 59 | |
| Question Answering | TruthfulQA MC1 | MC1 Accuracy88.8 | 54 | |
| Machine-Generated Text Detection | TruthfulQA | TPR@FPR-1% (ChatGLM)98.38 | 54 | |
| Question Answering | TruthfulQA | Performance Score81.1 | 52 | |
| Truthfulness | TruthfulQA | Truthfulness Accuracy72.36 | 51 | |
| Truthful Question Answering | TruthfulQA MC2 | MC2 Accuracy56.46 | 51 | |
| Open-ended Generation | TruthfulQA | BLEURT Score70.13 | 48 | |
| Predicting answer correctness | TruthfulQA | AUROC0.7272 | 48 | |
| Truthful and Informative Generation | TruthfulQA (test) | True*Info (%)84.7 | 44 | |
| Question Answering | TruthfulQA | MC268.25 | 43 | |
| Generation correctness prediction | TruthfulQA (test) | AURC62.69 | 42 | |
| Hallucination | TruthfulQA | Score75.76 | 42 | |
| Question Answering | TruthfulQA | Truthful*Inf Score88.23 | 42 |