| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Language Modeling | LongBench | Average Score58.4 | 328 | |
| Long-context Language Understanding | LongBench | M-Avg60.31 | 294 | |
| Long-context language understanding | LongBench (test) | Average Score51.87 | 147 | |
| Long-context understanding | LongBench | F1 Score34 | 143 | |
| Long-context understanding | LongBench (test) | Avg Score58.7 | 136 | |
| Long Context Understanding | LongBench V2 | Overall Score82.36 | 133 | |
| Query Routing | LongBench OOD v2 | QA53 | 120 | |
| Long-context understanding | LongBench | Overall Average Score62.1 | 115 | |
| Long-context Reasoning | LongBench | Accuracy (LongBench)70.4 | 101 | |
| Long-context language understanding | LongBench-e | Average Score53.04 | 93 | |
| Long-context Evaluation | LongBench | Average Score31.96 | 90 | |
| Long-context Reasoning | LongBench v2 | Average Score68.2 | 88 | |
| Long-context Language Understanding | LongBench | Average Score58.4 | 86 | |
| Long-context understanding | LongBench 1.0 (test) | NarrativeQA32.94 | 84 | |
| Long-context understanding | LongBench | HotpotQA57.15 | 82 | |
| Single-Doc Question Answering | LongBench | MultifieldQA Score53.67 | 75 | |
| Long-context understanding | LongBench (test) | FewShot Performance71.4 | 72 | |
| Long-context Question Answering | LongBench (test) | HotpotQA7,011 | 69 | |
| Question Answering | LongBench Qasper | F10.4459 | 62 | |
| Long-context language understanding | LongBench v2 | Overall Accuracy46.32 | 62 | |
| Long-context Reasoning | LongBench | Score73.8 | 62 | |
| Long-context language understanding | LongBench 1.0 (test) | MultiNews61.5 | 61 | |
| Long-context Understanding | LongBench | Accuracy103 | 60 | |
| Long document retrieval | LongBench Retrieval v2 (full) | F1 Score0.4843 | 55 | |
| Few-shot Learning | LongBench | TREC Score82.62 | 51 |