| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-context language modeling | RULER | RULER Score0.9142 | 204 | |
| Needle-in-a-Haystack Retrieval | RULER | S-NIAH-1 (Pass-Key Retrieval)100 | 94 | |
| Long-context Understanding | RULER 16k (test) | RULER Score93.5 | 90 | |
| Long-context Understanding | RULER 4k (test) | RULER 4k Score95.7 | 90 | |
| Long-context retrieval and aggregation | RULER 32k | Average Accuracy89.56 | 76 | |
| Long-context retrieval and aggregation | RULER 16k | Average Accuracy93.07 | 76 | |
| Long-context retrieval and aggregation | RULER 8k | Average Accuracy94.24 | 76 | |
| Long-context retrieval and aggregation | RULER 4k | Average Accuracy94.73 | 76 | |
| Long-context language modeling | RULER | Accuracy (8K Context)96.29 | 75 | |
| Long-context language modeling | RULER 16K context | Accuracy (RULER 16K)83 | 72 | |
| Long-context language modeling evaluation | RULER Context Length = 8K | Average Accuracy (RULER 8K)89.59 | 72 | |
| Long-context understanding | RULER | Score96 | 66 | |
| Long-context understanding | RULER | Performance @ 4K Context157 | 65 | |
| Long-context evaluation | RULER 16k | Total Score95.02 | 59 | |
| Long-context evaluation | RULER | Average Accuracy Score92.8 | 54 | |
| Long-context language modeling evaluation | RULER | Score (4K)97.36 | 49 | |
| Long-context evaluation | RULER 32k | Overall Score90.06 | 49 | |
| Long-context retrieval and synthetic reasoning | RULER | Accuracy83.01 | 47 | |
| Long-Context Retrieval | RULER | Retrieval Accuracy (8K)98.14 | 44 | |
| Long-context Evaluation | Ruler (test) | S-NIAH-1100 | 43 | |
| Long-context evaluation | RULER 64k | VT Score100 | 43 | |
| Long-context language understanding | RULER 32k context length | FWE0 | 39 | |
| Long-context understanding | RULER 32K | Accuracy94.48 | 38 | |
| Text Question Answering | RULER | Accuracy70.7 | 37 | |
| Long-context language modeling | Ruler llama3-8B-Instruct (test) | S-NIAH-1100 | 37 |