| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context language modeling | RULER | RULER Score0.911 | 148 | |
| Long-context understanding | RULER | Performance @ 4K Context157 | 65 | |
| Long-context evaluation | RULER 16k | Total Score95.02 | 59 | |
| Long-context understanding | RULER | Score94.45 | 45 | |
| Needle-in-a-Haystack Retrieval | RULER | S-NIAH-1 (Pass-Key Retrieval)100 | 42 | |
| Long-context evaluation | RULER 32k | Overall Score89.3 | 41 | |
| Long-context evaluation | RULER 8k | Score91.07 | 35 | |
| Long-context evaluation | RULER 4k | Score93.73 | 35 | |
| Long-context language modeling | RULER | Accuracy (8K Context)90.97 | 34 | |
| Long-context language understanding | RULER 32k context length | Average Score87.5 | 30 | |
| Long-context evaluation | RULER 128k | Query Metric (MQ)98 | 29 | |
| Long-context evaluation | RULER 64k | VT Score100 | 29 | |
| Needle In A Haystack | Ruler NIAH (Single 2) | Accuracy1 | 25 | |
| Length Extrapolation | RULER | Performance @ 8K Context92.88 | 18 | |
| Long-Context Retrieval | RULER | Retrieval Accuracy (8K)96.2 | 17 | |
| Long-context understanding | RULER 32K | Accuracy92.33 | 16 | |
| Long-context language modeling | RULER 1.0 (test) | Accuracy (4K Context)0.977 | 16 | |
| Memory | RULER HotpotQA | Score (7K)79.69 | 15 | |
| Long-context language modeling and retrieval | RULER | VT Score96.4 | 14 | |
| Needle In A Haystack | Ruler NIAH Single 3 | Accuracy84 | 13 | |
| Long-context retrieval | RULER 64K context | Accuracy84.3 | 13 | |
| Long-context understanding | RULER (dev) | Accuracy (4K Context)96.1 | 13 | |
| Long-context capability evaluation | RULER 32768 length | Accuracy91.87 | 12 | |
| Long-context capability evaluation | RULER 16384 length | Accuracy92.02 | 12 | |
| Long-context capability evaluation | RULER 8192 length | Accuracy93.75 | 12 |