| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Language Understanding | LongBench | M-Avg60.31 | 219 | |
| Long-context language understanding | LongBench (test) | Average Score51.87 | 133 | |
| Long-context understanding | LongBench | Overall Average Score62.1 | 115 | |
| Long-context understanding | LongBench (test) | Avg Score54 | 80 | |
| Long-context Reasoning | LongBench | Score73.8 | 62 | |
| Long-context Understanding | LongBench | Accuracy103 | 60 | |
| Long-context Question Answering | LongBench (test) | HotpotQA7,011 | 59 | |
| Long document retrieval | LongBench Retrieval v2 (full) | F1 Score0.4843 | 55 | |
| Long-context Reasoning | LongBench v2 | Average Score68.2 | 48 | |
| Long-context Language Modeling | LongBench | Single-Document QA42.77 | 44 | |
| Long-context language modeling | LongBench-E 1.0 (test) | S-Doc QA Perf.49.92 | 37 | |
| Long Context Understanding | LongBench V2 | Overall Score65.6 | 37 | |
| Single-Doc Question Answering | LongBench | MultifieldQA Score49.34 | 36 | |
| Long-context understanding | LongBench 1.0 (test) | NarrativeQA26.63 | 32 | |
| Long-context understanding | LongBench V1 | NQA31 | 30 | |
| Long-context understanding | LongBench (test) | SingleDoc Performance45.2 | 30 | |
| Long-context understanding | LongBench | 2WikiMQA55.13 | 25 | |
| Long-context Question Answering | LongBench V2 | SingleDoc Accuracy51.43 | 22 | |
| Long-context understanding | LongBench v1 (test) | SD QA49.6 | 21 | |
| Long-context language understanding | LongBench 1.0 (test) | MultiNews61.5 | 21 | |
| Long-context language understanding | LongBench v2 | Overall Accuracy46.32 | 20 | |
| Long-context understanding | LongBench | MFQA30.94 | 18 | |
| Long-context understanding | LongBench | Overall Average Score31.8 | 17 | |
| Long-context Language Understanding | LongBench-e (test) | LCC (Language Comprehension Score)68.42 | 16 | |
| Long-context generation | LongBench | Average Score48.5 | 16 |