Knowledge-intensive reasoning

Benchmarks

Dataset Name	SOTA Method	Metric
HLE	R1-Searcher	Avg Score85	75	2mo ago
Musique	R1-Searcher	Accuracy87	51	1mo ago
MuSiQue	Llama3.1-8B + ARPO	F1 Score34.8	43	1mo ago
HotpotQA	Llama3.1-8B + ARPO	F1 Score0.654	41	1mo ago
2WikiMultiHopQA	APPO	Accuracy81.5	38	1mo ago
SuperGPQA		Overall Score72.7	35	23d ago
Knowledge-Intensive Reasoning Suite 2Wiki., Bamb., HQA, MuSi., SimQA		2Wiki Score58.4	25	3mo ago
Bamboogle	Llama3.1-8B + ARPO	F173.8	23	1mo ago
WebWalker	APPO	WebWalker Accuracy33.5	20	1mo ago
2wikiMultiHopQA	Qwen2.5-7B + GRPO	F1 Score76.1	18	4mo ago
WebWalker	Llama3.1-8B + ARPO	F1 Score30.5	18	4mo ago
HQA	AutoTraj	Average Score87	18	4mo ago
Bamboogle	EAPO	F1 Score60.4	15	1mo ago
2WikiMultihopQA	EAPO	F1 Score58.6	15	1mo ago
GPQA	CPPO	Result Score38.89	14	3mo ago
GPQA ambiguity-augmented	DisambiguSLM	Accuracy42.8	11	2mo ago
2Wiki	AutoTraj	Average Score0.89	9	4mo ago
C-Eval		Score90.2	7	2mo ago
MMLU-CF first 1,000 samples (test)	MGRS	Exact Match Accuracy74.2	7	4mo ago
Knowledge-intensive reasoning suite (HotpotQA, 2WikiMultihopQA, Musique)	TEPOdense	HotpotQA Score43.6	6	4mo ago
2Wiki	EAPO	F1 Score52	5	1mo ago
Generalization Verification	KDCM + Code Module	Hits@199.18	5	4mo ago

Showing 22 of 22 rows