| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| TruthfulQA | HICD | MC149.75 | 40 | 4d ago | |
| Held-out Dataset | F-DPO | Factuality Score8.84 | 21 | 4d ago | |
| QAGS XSUM | ALIGN | Pearson Correlation53.9 | 19 | 4d ago | |
| FACTOR | SLED | FACTOR Score75.55 | 18 | 4d ago | |
| TruthfulQA latest (test) | SkillAggregation-X | Accuracy84.57 | 16 | 4d ago | |
| AggreFact-XSum FTS | AlignScore | Balanced Accuracy80.2 | 15 | 4d ago | |
| AggreFact-CNN (OLD) | FENICEGPT_claims | Balanced Accuracy82.1 | 15 | 4d ago | |
| AggreFact CNN (EXF) | SummaC-ZS | Balanced Accuracy76.5 | 15 | 4d ago | |
| AggreFact-CNN (FTS) | SummaC-Cv | Balanced Accuracy70.3 | 15 | 4d ago | |
| AggreFact-XSum (OLD) | MENLI | Balanced Accuracy73.9 | 14 | 4d ago | |
| AggreFact-XSum (EXF) | AlignScore | Balanced Accuracy0.799 | 14 | 4d ago | |
| AggreFact (FTSOTA) | FENICE_GPT_claims | Balanced Accuracy (CNN-FTS)70.5 | 14 | 4d ago | |
| LLM-AggreFact (test) | MiniCheck-FT5 | CNN Score69.9 | 13 | 4d ago | |
| Rank19 | Accuracy83.9 | 13 | 4d ago | ||
| Biography | Yi-1.5-9B | Correctness Count29.4 | 12 | 4d ago | |
| SQUAD v2 | Yi-1.5-9B | Correct Count28.9 | 12 | 4d ago | |
| HotpotQA | RLFH | Average Score0.686 | 12 | 4d ago | |
| TruthfulQA | Factuality Score (0-shot)64.3 | 12 | 4d ago | ||
| QAGS CNN | BARTSCORE | Pearson Correlation0.735 | 11 | 4d ago | |
| FactScore (unlabeled) | PaCE | US (%)76.4 | 10 | 4d ago | |
| FactScore (labeled) | PaCE | LS Score (%)64.8 | 10 | 4d ago | |
| XSUM FRANK | ENTFA | Partial Pearson's ρ0.183 | 9 | 4d ago | |
| BIO (test) | CaLF | FS Score88.9 | 8 | 4d ago | |
| XSUM | ENTFA | PCC0.268 | 7 | 4d ago | |
| Long-form summarization factuality dataset (test) | FENICE | Balanced Accuracy66.2 | 5 | 4d ago |