| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Bolmo Evaluation Suite GenQA 7B | Llama 3.1 70B | GenQA Average81.6 | 39 | 12d ago | |
| MsMARCO (test) | Match-LSTM | ROUGE Score40.7 | 18 | 1mo ago | |
| MsMARCO (dev) | RAG | ROUGE Score57.2 | 11 | 1mo ago | |
| Lu Xun's essay collections | CharacterBot | Content Score3.758 | 10 | 1mo ago | |
| Amazon (test) | Prior-Aug | EM57.99 | 8 | 1mo ago | |
| Reddit (test) | EM61.19 | 8 | 1mo ago | ||
| BioASQ (test) | SWEP | EM43.01 | 8 | 1mo ago | |
| NYT (test) | SWEP | EM76.42 | 8 | 1mo ago | |
| Wiki (test) | SWEP | EM73.34 | 8 | 1mo ago | |
| FatwaQA | Gemini-3-Pro | Accuracy67 | 7 | 1mo ago | |
| DriveLM (test) | DriveLM-Agent | BLEU-453.09 | 5 | 1mo ago | |
| SQuAD | Blended RAG | EM57.63 | 3 | 1mo ago |