Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Long-context retrieval and reasoning on RULER 11 tasks average
Loading...
99.34
Context Length 4K Performance
Full
37.4288
53.5019
69.575
85.6481
May 9, 2026
Context Length 4K Performance
Context Length 8K Performance
Context Length 16K Performance
Context Length 32K Performance
Context Length 64K Performance
Context Length 128K Performance
RULER 11 Tasks Average Performance
Updated 21d ago
Evaluation Results
Method
Method
Links
Context Length 4K Performance
Context Length 8K Performance
Context Length 16K Performance
Context Length 32K Performance
Context Length 64K Performance
Context Length 128K Performance
RULER 11 Tasks Average Performance
Full
Model=Llama3.1-8B-Inst...
2026.05
99.34
98.83
98.55
94.89
89.85
79.32
93.46
ReST-KV
Model=Llama3.1-8B-Inst...
2026.05
94.01
86.66
84.12
81.87
78.65
68.28
82.27
SnapKV
Model=Llama3.1-8B-Inst...
2026.05
83.6
75.54
71.12
66.95
57.47
47.99
67.11
PyramidKV
Model=Llama3.1-8B-Inst...
2026.05
81.35
73.66
70.23
69.83
57.84
48.93
66.97
Streaming
Model=Llama3.1-8B-Inst...
2026.05
39.81
18.42
12.1
10.57
9.91
8.18
16.5
Feedback
Search any
task
Search any
task