| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | NarrativeQA | F1 Score30.41 | 87 | |
| Question Answering | NarrativeQA (test) | ROUGE-L76.2 | 61 | |
| Question Answering | NarrativeQA | Score21.96 | 40 | |
| Long-context Question Answering | NarrativeQA | F1 Score53.56 | 38 | |
| Question Answering | NarrativeQA | F133.61 | 36 | |
| Single-hop Question Answering | NarrativeQA | Score23.89 | 22 | |
| Question Answering | NarrativeQA | EM34.79 | 18 | |
| Question Answering | NarrativeQA No Trun. latest (test) | Average Score23.96 | 18 | |
| Question Answering | NarrativeQA | F1 Score28.94 | 16 | |
| Long narrative understanding QA | NarrativeQA | Accuracy55 | 14 | |
| Multi-session Retrieval-Augmented Generation | NarrativeQA (test) | F1 Score38.4 | 12 | |
| Document Retrieval | NarrativeQA (test) | nDCG@1061.7 | 12 | |
| Long-context Question Answering | NarrativeQA | Exact Match61.7 | 11 | |
| Question Answering | NarrativeQA Helmet benchmark | F1 Score49.5 | 9 | |
| Question Answering | NarrativeQA Trun. latest (test) | Average Score21.34 | 9 | |
| Retrieval | NarrativeQA | Recall@329.11 | 8 | |
| Reading Comprehension | NarrativeQA (test) | BLEU-154.11 | 8 | |
| Reading Comprehension | NarrativeQA summaries | BLEU-136.55 | 8 | |
| Question Answering | NarrativeQA | Prefill Throughput (tok/s)24,686.84 | 6 | |
| Latency Evaluation | NarrativeQA | End-to-End Latency2.1 | 6 | |
| Question Answering | NarrativeQA summaries (test) | BLEU-143.63 | 6 | |
| Reading Comprehension | NarrativeQA Story Summaries (val) | BLEU-152.78 | 6 | |
| Question Answering | NarrativeQA | ROUGE-L0.32 | 5 | |
| Question Answering | NarrativeQA (dev) | ROUGE-L31.6 | 4 | |
| Multi-mention reading comprehension | NarrativeQA (test) | ROUGE-L58.8 | 4 |