| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| RealHitBench | DeepSeek-R1 | Exact Match70.91 | 49 | 4d ago | |
| PubHealth | KG-CRAFTL3.3 | Balanced Accuracy78.66 | 26 | 4d ago | |
| COVID-Fact | OpenAI o1 | Balanced Acc75.9 | 22 | 4d ago | |
| LIAR-RAW | KG-CRAFT | Precision77.38 | 20 | 4d ago | |
| FEVEROUS (test) | Trification | Macro F174.72 | 20 | 4d ago | |
| InFi-Check-FG 1.0 (test) | Llama-3.1-8B-Instruct | PredE18.82 | 18 | 4d ago | |
| FeLMWk | PCC | F1 (True)0.79 | 16 | 4d ago | |
| HOVER 4-hop (test) | Trification | Macro F166.23 | 16 | 4d ago | |
| HOVER 3-hop (test) | Trification | Macro F166.42 | 16 | 4d ago | |
| HOVER 2-hop (test) | Trification | Macro F175.13 | 16 | 4d ago | |
| Average across General and Medical Domains | Overall Average73.6 | 15 | 4d ago | ||
| SCIFact | OpenAI o1 | Balanced Acc90.3 | 15 | 4d ago | |
| ExpertQA | GraphCheck | Balanced Accuracy60.3 | 15 | 4d ago | |
| SummEval | Balanced Accuracy77.3 | 15 | 4d ago | ||
| AggreFact CNN | GraphEval | Balanced Acc69.5 | 15 | 4d ago | |
| AggreFact Xsum | GPT-4o | Balanced Accuracy76.4 | 15 | 4d ago | |
| FEVEROUS | F1 Macro89.4 | 14 | 4d ago | ||
| FEVER | F1 Macro94.3 | 14 | 4d ago | ||
| FEVEROUS-S | RRC | Macro F172.55 | 12 | 4d ago | |
| HOVER | FOLK | Macro F1 (2-hop)71.82 | 12 | 4d ago | |
| LIAR | Accuracy79 | 12 | 4d ago | ||
| LIAR (test) | UPA | Accuracy68.2 | 11 | 4d ago | |
| LLM-AGGREFACT (test) | AlignScore | Cost ($)0.2 | 10 | 4d ago | |
| FEVER v1.0 (dev) | LongLLMLingua | Acc55.1 | 10 | 4d ago | |
| AVeriTeC (test) | HerO | Hu-METEOR (Q only)0.48 | 9 | 4d ago |