Long-context language modeling

Benchmarks

Dataset Name	SOTA Method	Metric
LongBench		Average Score58.4	369	25d ago
RULER	DASH-3	RULER Score0.9142	204	1mo ago
RULER	Qwen2.5-14B-Instruct-1M	Accuracy (8K Context)96.29	80	24d ago
RULER 16K context	SINK + TAIL + TOP-K + ϕ	Accuracy (RULER 16K)83	72	3mo ago
LongBench (test)		Qasper Score50.87	54	25d ago
Ruler llama3-8B-Instruct (test)		S-NIAH-1100	37	4mo ago
LongBench-E 1.0 (test)	Elastic Attention	S-Doc QA Perf.49.92	37	4mo ago
RULER 4K		Accuracy95.3	29	1mo ago
HELMET		Summarization Score247	27	4mo ago
LongBench	RaBitQCache	Generation Score50.6	24	24d ago
ZeroSCROLLS (test)	GDWM	GovReport Score35.8	24	4mo ago
InfiniteBench	Fast-dLLM v2	Code Debug Accuracy46.19	22	1mo ago
LongBench 4-task average	2d hetero	Average Accuracy12.7	17	3mo ago
RULER 1.0 (test)	MInference	Accuracy (4K Context)0.977	16	4mo ago
InfiniteBench (test)	Llama 3.1 8B Instruct	En QA Score34.82	14	3mo ago
RULER	MoE Transformer	S1 Score100	13	23d ago
RULER (test)	ProxyAttn	Sparsity80	13	3mo ago
LongBench v2	FoLoRA	Accuracy29.62	12	1mo ago
LongBench	AdaKV w/ CriticalKV	LongBench Average Score46.23	12	1mo ago
LongBench-En Filtered Tasks	CPC	Single Document Score42.6	9	1mo ago
RULER Sequence length = 64k	RDKV	S-NIAH Score (Component 1)100	8	1mo ago
LongBench 1.0 (test)		NrtvQA32.6	8	3mo ago
LongBench MultiFieldQA, MuSiQue, GovReport 2023 (test)	DroPE	MultiFieldQA Score32.18	8	4mo ago
RULER (test)	Baseline	Accuracy (4k Context)96.6	7	3mo ago
LongBench V2 (test)		Acc (Short)60	7	3mo ago

Showing 25 of 38 rows