Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RULER

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context language modelingRULER
RULER Score0.911
148
Long-context language modelingRULER 16K context
Accuracy (RULER 16K)83
72
Long-context language modeling evaluationRULER Context Length = 8K
Average Accuracy (RULER 8K)89.59
72
Long-context understandingRULER
Performance @ 4K Context157
65
Long-context evaluationRULER 16k
Total Score95.02
59
Long-context language modelingRULER
Accuracy89.1
51
Long-context understandingRULER
Score96
50
Long-context retrieval and synthetic reasoningRULER
Accuracy83.01
47
Long-context EvaluationRuler (test)
S-NIAH-1100
43
Needle-in-a-Haystack RetrievalRULER
S-NIAH-1 (Pass-Key Retrieval)100
42
Long-context evaluationRULER 32k
Overall Score89.3
41
Long-context language modelingRuler llama3-8B-Instruct (test)
S-NIAH-1100
37
Long-context evaluationRULER 8k
Score91.07
35
Long-context evaluationRULER 4k
Score93.73
35
Long-context evaluationRULER
Accuracy (Context 4k)98.8
34
Long-Context RetrievalRULER
Retrieval Accuracy (8K)96.2
34
Variable TrackingRULER-VT
Accuracy99.9
33
Long-context language understandingRULER 32k context length
VT Score98.2
33
Long-context evaluationRULER 128k
Query Metric (MQ)98
29
Long-context evaluationRULER 64k
VT Score100
29
Long-context understandingRULER 32K
Accuracy94.48
26
Needle In A HaystackRuler NIAH (Single 2)
Accuracy1
25
Long-context understandingRULER 64K
Accuracy92.37
25
Long-context UnderstandingRULER
Performance (8K Context)92.88
24
Long-context understandingRULER
S1 Score100
20
Showing 25 of 132 rows