| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| LongBench | CortexDebate | M-Avg60.31 | 292 | 19d ago | |
| LongBench (test) | Average Score51.87 | 147 | 3d ago | ||
| LongBench | Average Score58.4 | 86 | 9d ago | ||
| InfiniteBench | Full | En.Sum33.01 | 81 | 1mo ago | |
| LongBench 1.0 (test) | Original | MultiNews61.5 | 61 | 9d ago | |
| LongBench v2 | HyLRA | Overall Accuracy46.32 | 47 | 16d ago | |
| RULER 32k context length | PQcache | VT Score98.2 | 33 | 3d ago | |
| L-Eval | NTK | Coursera58.28 | 26 | 1mo ago | |
| L-Eval (test) | Coursera58.28 | 26 | 1mo ago | ||
| LongBench v1 (test) | NrtvQA Score30.7 | 22 | 4d ago | ||
| SCROLLS (test) | COLT5-XL | Average Score47.4 | 18 | 1mo ago | |
| RULER 16k context length | RetroInfer | Average Score94.73 | 16 | 3d ago | |
| SCBench | Llama-3.1-8B | KV Retrieval79 | 16 | 1mo ago | |
| LongBench-e (test) | HATA | LCC (Language Comprehension Score)68.42 | 16 | 1mo ago | |
| LongBench Ministral-8B-Instruct | StructKV | NrtvQA30.21 | 14 | 9d ago | |
| LongBench 1 host v1 (test) | kvtc16x | 2WQA Score46.23 | 14 | 1mo ago | |
| Long tasks 4 tasks (val) | EuroBERT/EuroBERT-610m | Long Tasks Score83.24 | 13 | 1mo ago | |
| LongBench-e | Exact | LCC69.96 | 12 | 24d ago | |
| RULER 128k | RetroInfer | Accuracy89.49 | 10 | 19d ago | |
| RULER 4k context length | LLama2-7B-chat | VT Score27 | 10 | 3d ago | |
| RULER 64k context length | CLAA | QA Score63.8 | 9 | 3d ago | |
| RULER 0 shot v1 (test) | CWE Score94.71 | 7 | 1mo ago | ||
| LongBench Llama-3.2-1B-Instruct (test) | NQA16.12 | 7 | 1mo ago | ||
| SCROLLS (dev) | BARTlarge-SLED | GovRep ROUGE-157.4 | 7 | 1mo ago | |
| LongBench 20 samples/task | Mamba-1.4B | NarrQA Performance1.91 | 4 | 4d ago |