| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | NarrativeQA | F1 Score32.6 | 92 | |
| Question Answering | NarrativeQA (test) | ROUGE-L76.2 | 68 | |
| Question Answering | NarrativeQA | Score21.96 | 40 | |
| Long-context Question Answering | NarrativeQA | F1 Score53.56 | 38 | |
| Long-context Question Answering | NarrativeQA | SubEM22 | 36 | |
| Question Answering | NarrativeQA | F133.61 | 36 | |
| Question Answering | NarrativeQA LongBench | F1 Score14.76 | 24 | |
| Single-hop Question Answering | NarrativeQA | Score23.89 | 22 | |
| Long-context Question Answering | NarrativeQA Passage Split | Score32.64 | 18 | |
| Long-context Question Answering | NarrativeQA Fixed Chunk 2048 | Score32.64 | 18 | |
| Question Answering | NarrativeQA | EM34.79 | 18 | |
| Question Answering | NarrativeQA No Trun. latest (test) | Average Score23.96 | 18 | |
| Question Answering | NarrativeQA | ROUGE-L28.1 | 17 | |
| Question Answering | NarrativeQA | F1 Score28.94 | 16 | |
| Long narrative understanding QA | NarrativeQA | Accuracy55 | 14 | |
| Traceback (Prompt Injection Attacks) | NarrativeQA | Precision98 | 13 | |
| Multi-session Retrieval-Augmented Generation | NarrativeQA (test) | F1 Score38.4 | 12 | |
| Document Retrieval | NarrativeQA (test) | nDCG@1061.7 | 12 | |
| Long-context Question Answering | NarrativeQA | Exact Match61.7 | 11 | |
| Context Traceback | NarrativeQA LongBench | Precision96 | 10 | |
| Question Answering | NarrativeQA | Ragas Answer Relevance58.08 | 9 | |
| Question Answering | NarrativeQA Helmet benchmark | F1 Score49.5 | 9 | |
| Question Answering | NarrativeQA Trun. latest (test) | Average Score21.34 | 9 | |
| Retrieval | NarrativeQA | Recall@329.11 | 8 | |
| Reading Comprehension | NarrativeQA (test) | BLEU-154.11 | 8 |