| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context Language Understanding | LongBench | M-Avg60.31 | 292 | |
| Long-context Language Modeling | LongBench | Average Score58.4 | 164 | |
| Long-context language understanding | LongBench (test) | Average Score51.87 | 147 | |
| Long-context understanding | LongBench (test) | Avg Score58.7 | 136 | |
| Long-context understanding | LongBench | Overall Average Score62.1 | 115 | |
| Long Context Understanding | LongBench V2 | Overall Score65.6 | 109 | |
| Long-context Language Understanding | LongBench | Average Score58.4 | 86 | |
| Long-context understanding | LongBench | HotpotQA57.15 | 82 | |
| Single-Doc Question Answering | LongBench | MultifieldQA Score53.67 | 75 | |
| Long-context Question Answering | LongBench (test) | HotpotQA7,011 | 69 | |
| Long-context Reasoning | LongBench | Score73.8 | 62 | |
| Long-context language understanding | LongBench 1.0 (test) | MultiNews61.5 | 61 | |
| Long-context Understanding | LongBench | Accuracy103 | 60 | |
| Long-context Evaluation | LongBench | Average Score31.96 | 57 | |
| Long document retrieval | LongBench Retrieval v2 (full) | F1 Score0.4843 | 55 | |
| Few-shot Learning | LongBench | TREC Score82.62 | 51 | |
| Summarization | LongBench | GovRep Score33.39 | 51 | |
| Long-context Reasoning | LongBench v2 | Average Score68.2 | 48 | |
| Long-context language understanding | LongBench v2 | Overall Accuracy46.32 | 47 | |
| Long-context Reasoning | LongBench | Accuracy (LongBench)68.7 | 45 | |
| Multi-Document Question Answering | LongBench | HotpotQA Acc57.67 | 45 | |
| Code Analysis | LongBench | Lcc Score70.64 | 43 | |
| Synthetic Tasks | LongBench | PCount14.67 | 43 | |
| Multi-choice Question Answering | LongBench v2 | Overall Accuracy46.5 | 41 | |
| Long-context language modeling | LongBench-E 1.0 (test) | S-Doc QA Perf.49.92 | 37 |