Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RULER

Benchmarks

Task NameDataset NameSOTA ResultTrend
Long-context language modelingRULER
RULER Score0.9142
204
Needle-in-a-Haystack RetrievalRULER
S-NIAH-1 (Pass-Key Retrieval)100
94
Long-context UnderstandingRULER 16k (test)
RULER Score93.5
90
Long-context UnderstandingRULER 4k (test)
RULER 4k Score95.7
90
Long-context retrieval and aggregationRULER 32k
Average Accuracy89.56
76
Long-context retrieval and aggregationRULER 16k
Average Accuracy93.07
76
Long-context retrieval and aggregationRULER 8k
Average Accuracy94.24
76
Long-context retrieval and aggregationRULER 4k
Average Accuracy94.73
76
Long-context language modelingRULER
Accuracy (8K Context)96.29
75
Long-context language modelingRULER 16K context
Accuracy (RULER 16K)83
72
Long-context language modeling evaluationRULER Context Length = 8K
Average Accuracy (RULER 8K)89.59
72
Long-context understandingRULER
Score96
66
Long-context understandingRULER
Performance @ 4K Context157
65
Long-context evaluationRULER 16k
Total Score95.02
59
Long-context evaluationRULER
Average Accuracy Score92.8
54
Long-context language modeling evaluationRULER
Score (4K)97.36
49
Long-context evaluationRULER 32k
Overall Score90.06
49
Long-context retrieval and synthetic reasoningRULER
Accuracy83.01
47
Long-Context RetrievalRULER
Retrieval Accuracy (8K)98.14
44
Long-context EvaluationRuler (test)
S-NIAH-1100
43
Long-context evaluationRULER 64k
VT Score100
43
Long-context language understandingRULER 32k context length
FWE0
39
Long-context understandingRULER 32K
Accuracy94.48
38
Text Question AnsweringRULER
Accuracy70.7
37
Long-context language modelingRuler llama3-8B-Instruct (test)
S-NIAH-1100
37
Showing 25 of 220 rows
...