| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | NarrativeQA | F1 Score38.12 | 124 | |
| Question Answering | NarrativeQA (test) | ROUGE-L76.2 | 88 | |
| Question Answering | NarrativeQA | Score21.96 | 40 | |
| Question Answering | NarrativeQA | EM15 | 38 | |
| Long-context Question Answering | NarrativeQA | F1 Score53.56 | 38 | |
| Long-context Question Answering | NarrativeQA | SubEM22 | 36 | |
| Question Answering | NarrativeQA | F133.61 | 36 | |
| Question Answering | NarrativeQA | BLEU-121.5 | 28 | |
| Question Answering | NarrativeQA LongBench | F1 Score14.76 | 24 | |
| Single-hop Question Answering | NarrativeQA | Score23.89 | 22 | |
| Long-context Question Answering | NarrativeQA Passage Split | Score32.64 | 18 | |
| Long-context Question Answering | NarrativeQA Fixed Chunk 2048 | Score32.64 | 18 | |
| Question Answering | NarrativeQA | EM34.79 | 18 | |
| Question Answering | NarrativeQA No Trun. latest (test) | Average Score23.96 | 18 | |
| Question Answering | NarrativeQA | F1 Score28.94 | 16 | |
| Question Answering | NarrativeQA | Rouge-L45 | 15 | |
| Long narrative understanding QA | NarrativeQA | Accuracy55 | 14 | |
| Traceback (Prompt Injection Attacks) | NarrativeQA | Precision98 | 13 | |
| Question Answering | NarrativeQA | TTFT (ms)355.12 | 12 | |
| Question Answering | NarrativeQA | Peak GPU Memory (GB)0.58 | 12 | |
| Question Answering | NarrativeQA LongBench 32K context | F1 Score17.2 | 12 | |
| Multi-session Retrieval-Augmented Generation | NarrativeQA (test) | F1 Score38.4 | 12 | |
| Document Retrieval | NarrativeQA (test) | nDCG@1061.7 | 12 | |
| Prompt Injection Attack | NarrativeQA | ASR86 | 11 | |
| Long-context Question Answering | NarrativeQA | Exact Match61.7 | 11 |