Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RULER

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context language modelingRULER
RULER Score0.911
148
Long-context understandingRULER
Performance @ 4K Context157
65
Long-context evaluationRULER 16k
Total Score95.02
59
Long-context understandingRULER
Score94.45
45
Needle-in-a-Haystack RetrievalRULER
S-NIAH-1 (Pass-Key Retrieval)100
42
Long-context evaluationRULER 32k
Overall Score89.3
41
Long-context evaluationRULER 8k
Score91.07
35
Long-context evaluationRULER 4k
Score93.73
35
Long-context language modelingRULER
Accuracy (8K Context)90.97
34
Long-context language understandingRULER 32k context length
Average Score87.5
30
Long-context evaluationRULER 128k
Query Metric (MQ)98
29
Long-context evaluationRULER 64k
VT Score100
29
Needle In A HaystackRuler NIAH (Single 2)
Accuracy1
25
Length ExtrapolationRULER
Performance @ 8K Context92.88
18
Long-Context RetrievalRULER
Retrieval Accuracy (8K)96.2
17
Long-context understandingRULER 32K
Accuracy92.33
16
Long-context language modelingRULER 1.0 (test)
Accuracy (4K Context)0.977
16
MemoryRULER HotpotQA
Score (7K)79.69
15
Long-context language modeling and retrievalRULER
VT Score96.4
14
Needle In A HaystackRuler NIAH Single 3
Accuracy84
13
Long-context retrievalRULER 64K context
Accuracy84.3
13
Long-context understandingRULER (dev)
Accuracy (4K Context)96.1
13
Long-context capability evaluationRULER 32768 length
Accuracy91.87
12
Long-context capability evaluationRULER 16384 length
Accuracy92.02
12
Long-context capability evaluationRULER 8192 length
Accuracy93.75
12
Showing 25 of 71 rows