| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context language modeling | RULER | RULER Score0.911 | 148 | |
| Long-context language modeling | RULER 16K context | Accuracy (RULER 16K)83 | 72 | |
| Long-context language modeling evaluation | RULER Context Length = 8K | Average Accuracy (RULER 8K)89.59 | 72 | |
| Long-context understanding | RULER | Performance @ 4K Context157 | 65 | |
| Long-context evaluation | RULER 16k | Total Score95.02 | 59 | |
| Long-context language modeling | RULER | Accuracy89.1 | 51 | |
| Long-context understanding | RULER | Score96 | 50 | |
| Long-context retrieval and synthetic reasoning | RULER | Accuracy83.01 | 47 | |
| Long-context Evaluation | Ruler (test) | S-NIAH-1100 | 43 | |
| Needle-in-a-Haystack Retrieval | RULER | S-NIAH-1 (Pass-Key Retrieval)100 | 42 | |
| Long-context evaluation | RULER 32k | Overall Score89.3 | 41 | |
| Long-context language modeling | Ruler llama3-8B-Instruct (test) | S-NIAH-1100 | 37 | |
| Long-context evaluation | RULER 8k | Score91.07 | 35 | |
| Long-context evaluation | RULER 4k | Score93.73 | 35 | |
| Long-context evaluation | RULER | Accuracy (Context 4k)98.8 | 34 | |
| Long-Context Retrieval | RULER | Retrieval Accuracy (8K)96.2 | 34 | |
| Variable Tracking | RULER-VT | Accuracy99.9 | 33 | |
| Long-context language understanding | RULER 32k context length | VT Score98.2 | 33 | |
| Long-context evaluation | RULER 128k | Query Metric (MQ)98 | 29 | |
| Long-context evaluation | RULER 64k | VT Score100 | 29 | |
| Long-context understanding | RULER 32K | Accuracy94.48 | 26 | |
| Needle In A Haystack | Ruler NIAH (Single 2) | Accuracy1 | 25 | |
| Long-context understanding | RULER 64K | Accuracy92.37 | 25 | |
| Long-context Understanding | RULER | Performance (8K Context)92.88 | 24 | |
| Long-context understanding | RULER | S1 Score100 | 20 |