| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Truthfulness Evaluation | TruthfulQA | Accuracy70.8 | 93 | |
| Hallucination Detection | TruthfulQA (test) | AUC-ROC89.5 | 91 | |
| Multiple-Choice | TruthfulQA | MC1 Accuracy58.5 | 83 | |
| Question Answering | TruthfulQA | Accuracy86.6 | 82 | |
| Question Answering | TruthfulQA | Accuracy86.64 | 73 | |
| Machine-Generated Text Detection | TruthfulQA | TPR@FPR-1%94.85 | 48 | |
| Hallucination Detection | TruthfulQA | AUC (ROC)0.9417 | 47 | |
| Hallucination | TruthfulQA | Score75.76 | 42 | |
| Question Answering | TruthfulQA | Truthful*Inf Score88.23 | 42 | |
| Factuality Evaluation | TruthfulQA | MC149.75 | 40 | |
| Open ended generation | TruthfulQA Without Rejected Samples open-ended (full) | Truthfulness74.67 | 39 | |
| Open ended generation | TruthfulQA With All Samples open-ended (full) | Truthfulness82.75 | 39 | |
| Multiple-Choice Question Answering | TruthfulQA MC1 | MC1 Accuracy76.2 | 33 | |
| Truthfulness Evaluation | TruthfulQA | Reliability Score16.9 | 33 | |
| Truthfulness Evaluation | TruthfulQA (test) | MC154.95 | 30 | |
| Question Answering | TruthfulQA MC1 | MC1 Accuracy88.8 | 24 | |
| Truthfulness and Informativeness | TruthfulQA | TruthfulQA Score78.46 | 24 | |
| Short-Answer Factuality | TruthfulQA (test) | MC1 Factuality Score47.47 | 24 | |
| Truthfulness Evaluation | TruthfulQA medical (test) | Health Score83.6 | 22 | |
| Question Answering | TRUTHFULQA | Factual Accuracy47 | 21 | |
| Question Answering | TruthfulQA o=1 Domain-level split | Accuracy88.5 | 21 | |
| Question Answering | TruthfulQA o=1 Semantic-level | Accuracy90.9 | 21 | |
| Question Answering | TruthfulQA o=1 (Exact split) | Accuracy90 | 21 | |
| Question Answering | TruthfulQA Domain-level split, o=3 | Accuracy92.8 | 21 | |
| Question Answering | TruthfulQA Semantic-level split o=3 | Accuracy98.1 | 21 |