| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| LongBench | DPO w/ LongReward | Overall Average Score62.1 | 115 | 4d ago | |
| LongBench (test) | Avg Score54 | 80 | 4d ago | ||
| RULER | PyramidInfer | Performance @ 4K Context157 | 65 | 4d ago | |
| LongBench | QUOKA | Accuracy103 | 60 | 2d ago | |
| RULER | Score94.45 | 45 | 4d ago | ||
| LongBench V2 | Overall Score65.6 | 37 | 4d ago | ||
| LongBench 1.0 (test) | NarrativeQA26.63 | 32 | 4d ago | ||
| InfiniteBench v1 (test) | SnapKV | Dialogue20 | 31 | 4d ago | |
| LongBench V1 | NQA31 | 30 | 4d ago | ||
| LongBench (test) | VIST2-8B | SingleDoc Performance45.2 | 30 | 2d ago | |
| LongBench | 2WikiMQA55.13 | 25 | 4d ago | ||
| LongBench v1 (test) | Llama-3.1-8B | SD QA49.6 | 21 | 4d ago | |
| LongBench | TidalDecode | MFQA30.94 | 18 | 4d ago | |
| LongBench | BLASST | Overall Average Score31.8 | 17 | 4d ago | |
| RULER 32K | Accuracy92.33 | 16 | 4d ago | ||
| HELMET 2025 | MInference | Accuracy (8K Context)61.44 | 16 | 4d ago | |
| LongEval | Score79 | 16 | 4d ago | ||
| LongBench v0.2 | SpindleKV | Na.QA (NER QA)26.95 | 16 | 4d ago | |
| LongBench | PRD | Average Context Length (tokens)815,449.95 | 16 | 4d ago | |
| Long-Context Understanding | GPT-5 | Score66.8 | 14 | 4d ago | |
| RULER (dev) | Olmo 3 32B | Accuracy (4K Context)96.1 | 13 | 4d ago | |
| LongBench English | Quest | MultiNews Score26.49 | 12 | 4d ago | |
| InfiniteBench | FlexPrefill | En. MC Accuracy0.6812 | 12 | 4d ago | |
| LongBench E | TCA-Attention | Single Doc QA52.28 | 10 | 4d ago | |
| RULER | Sketch&Walk | Accuracy (4K)96.56 | 8 | 4d ago |