Multi-task Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
PostTrainBench		AIME 25 Score29.17	41	1mo ago
Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)	Primitives-based MAS	Average Accuracy77.6	21	4mo ago
Aggregate (GSM8K, BFCL, Spider, HumanEval)	RLSTA	Average Accuracy79.4	20	1mo ago
Aggregated Downstream Benchmark Suite math_500, GSM8K, IFEval, ARC, HellaSwag, MMLU, MBPP+, HumanEval+, WildGuard, HarmBench	TA	Mean Downstream Accuracy50.7	20	2mo ago
Aggregate All tasks (summary)	EvoRoute	Score74.6	20	4mo ago
Fairness and Utility Suite	Self-Debias Iter2 + Self-Correction	Average Score82.1	16	3mo ago
Average GSM8K, HumanEval, ARC-c	ReMix	Accuracy60.77	13	4mo ago
Aggregated Clinical Tasks		Average Score74.6	12	3mo ago
Aggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various)	Mistral (Full-Attention)	Avg Accuracy51.9	10	4mo ago
General and Medical Aggregation	Anchored Learning	Overall Average51.8	9	2mo ago
Aggregate (HealthQA, ARC-C, PopQA, Squad1, ASQA)	AMATA 8B	Average Score62.8	8	2mo ago
Spatial and Temporal Simulation Tasks Episode-level	Cortex (full harness)	Average Total Score7.81	7	18d ago
Spatial and Temporal Simulation Tasks Step-level	Cortex (full harness)	Average Total Score8.318	7	18d ago
Average (GSM8K-CoT, MATH, MBPP, HumanEval)	TAD-Q	Accuracy51.6	7	2mo ago
Mixed Benchmark Average	phi-balancing	Average Accuracy40.86	6	2mo ago
Average (test)		Hit Score52.45	6	4mo ago
VLM Tasks Average		Accuracy86.7	3	4mo ago
InternVL2-26B Task Suite Zeroing	TaLo	Persuasive Strategies Score58.3	2	4mo ago

Showing 18 of 18 rows