Long-context evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
LongBench	LOOKAHEADKV	Average Score31.96	96	23d ago
RULER 16k		Total Score95.02	62	1mo ago
RULER	RecaLLM-Qwen2.5-7B	Average Accuracy Score92.8	59	1mo ago
RULER 32k	RTPurbo	Overall Score90.06	49	2mo ago
Ruler (test)		S-NIAH-1100	43	4mo ago
RULER 64k	Llama-3.1-8B	VT Score100	43	2mo ago
RULER 8k	QUOKA	Score91.07	35	4mo ago
RULER 4k	QUOKA	Score93.73	35	22d ago
RULER 128k	Llama-3.1-8B	Query Metric (MQ)98	29	4mo ago
LongBench (test)	xKV	NarQA Score32.85	18	2mo ago
Ruler	Ministral-3-8B	Average Rank2	16	2mo ago
LB v2 (ALL)		Accuracy (ALL)38	13	4mo ago
L-Eval	InternLM2-Chat-20B-SFT	Close Score68.8	13	4mo ago
RULER 128k Zero-shot transfer	Still	Accuracy (RULER 128k Zero-shot)31.3	12	1mo ago
RULER 64k Zero-shot transfer	Still	Accuracy32.9	12	1mo ago
RULER 32k Zero-shot transfer	Still	Accuracy37.2	12	1mo ago
RULER 128k (matched-train holdout)	Still	Accuracy36.1	12	1mo ago
RULER 64k (matched-train holdout)	Still	Accuracy39.9	12	1mo ago
RULER 32k (matched-train holdout)	Still	Accuracy48.7	12	1mo ago
RULER 32K context length (test)		Niah1 Score100	12	4mo ago
LongBench (aggregate)		EM58.7	10	17d ago
LongBench v2	Qwen3-235B-A22B-Thinking	Overall Score59.76	9	2mo ago
RULER 4k context length (test)	GDN	MK25.36	7	16d ago
128K context		Quality Score (Q)80.12	6	2mo ago
MultiNews, Qasper, RepoBench-P, and RULER Averaged 128K (test)	TTKV	Memory Footprint (GB)15.3	6	3mo ago

Showing 25 of 38 rows