SOTA Reasoning evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
DialogSum	Aligner	Reasoning99.1	33	4mo ago
DeepfakeJudge Reason 1.0 (test)		BLEU-19	16	4mo ago
MR-Ben	Qwen2.5-Math-PRM-7B-PDDL-r	Math F156.1	10	3mo ago
ParaRev	Full DRO	WR vs. Base63.7	8	2mo ago
109-sample (test)	Universe Routing	Accuracy97.25	7	4mo ago
Limit Texas Hold’em		Hit Rate2	6	4mo ago
Leduc Hold’em		Hit Rate (HR)2	6	4mo ago
Full task set (n=45)		Overall Score8.84	5	4mo ago

Showing 8 of 8 rows