| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Bolmo Evaluation Suite GenQA 7B | Llama 3.1 70B | GenQA Average81.6 | 39 | 1mo ago | |
| MsMARCO (test) | Match-LSTM | ROUGE Score40.7 | 18 | 3mo ago | |
| MsMARCO (dev) | RAG | ROUGE Score57.2 | 11 | 3mo ago | |
| SimpleQA | RCSP | HALL Score92 | 10 | 1d ago | |
| MedHallu | HALL Score67.33 | 10 | 1d ago | ||
| WebQuestions | HALL Score52 | 10 | 1d ago | ||
| TruthfulQA (test) | Plan-and-Solve | HALL Score82.33 | 10 | 1d ago | |
| HaluEval (test) | Plan-and-Solve | HALL Rate50.33 | 10 | 1d ago | |
| Lu Xun's essay collections | CharacterBot | Content Score3.758 | 10 | 3mo ago | |
| Amazon (test) | Prior-Aug | EM57.99 | 8 | 3mo ago | |
| Reddit (test) | EM61.19 | 8 | 3mo ago | ||
| BioASQ (test) | SWEP | EM43.01 | 8 | 3mo ago | |
| NYT (test) | SWEP | EM76.42 | 8 | 3mo ago | |
| Wiki (test) | SWEP | EM73.34 | 8 | 3mo ago | |
| FatwaQA | Gemini-3-Pro | Accuracy67 | 7 | 2mo ago | |
| DriveLM (test) | DriveLM-Agent | BLEU-453.09 | 5 | 3mo ago | |
| TruthfulQA | KLAS | ROUGE-164.5 | 4 | 5d ago | |
| SQuAD | Blended RAG | EM57.63 | 3 | 3mo ago |