Long-context Reasoning

Benchmarks

Dataset Name	SOTA Method	Metric
LongBench v2	Qwen-3.5	Average Score68.9	113	22d ago
LongBench	GHG-TDA	Score73.8	107	24d ago
LongBench	CAP-CoT	Accuracy (LongBench)70.4	101	2mo ago
LoCoMo	FluxMem	F1 (Multi Hop)93.26	78	1mo ago
BABILong 16k		Accuracy29.8	72	3mo ago
BABILong 8k		Accuracy34.7	65	3mo ago
BABILong 4k		Accuracy (BABILong 4k)38.5	51	3mo ago
InfiniteBench		Overall Score36.51	45	24d ago
OOLONG	λ-RLM	Accuracy68.4	37	3mo ago
LongReason Average across 16k-64k		Accuracy54.2	36	29d ago
LongReason 64k prefill length		Accuracy53.02	36	29d ago
LongReason 32k prefill length		Accuracy53.9	36	29d ago
LongReason 16k prefill length 1.0 (test)		Accuracy55.67	36	29d ago
Long-context Benchmarks 100K context LB-V2 DocMath Frames LB-MQA (test)	Qwen3-30B-A3B-Thinking + SPELL	DocMath Score66.7	36	4mo ago
Long-context Benchmarks 16K context DocMath Frames LB-MQA V2 (test)	Qwen3-30B-A3B-Thinking + SPELL	DocMath64.1	36	4mo ago
RULER	HyLo-Llama-14MLA14M2	RULER Performance (8K Context)75.3	35	2mo ago
AA-LCR	Kimi-K2.6	Score70.2	35	1mo ago
LongReason 64K-input 70K context	KVZip	Accuracy71.25	34	1mo ago
∞ Bench	MiA (Emb-Only)	Accuracy90.39	32	4mo ago
OOLONG trec_coarse	Kimi K2	Score86.6	28	4mo ago
FRAMES		Score84.7	27	23d ago
OOL-Pairs		Latency (s)5.1	27	4mo ago
OOLONG		Latency (s)7.1	27	4mo ago
LongGenBench 8K		GSM8K Score44.51	22	1mo ago
LongGenBench 4K		GSM8K Score53.18	22	1mo ago

Showing 25 of 70 rows