General Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
Aggregate Benchmarks	Gemini 2.5 Pro+ASP	Average Score93.9	37	2mo ago
AGIEval		Accuracy70.22	29	4mo ago
Average Across all benchmarks	BASTION	Speedup6.9	28	1mo ago
UltraFeedback Aggregate	LP-SFT	Overall Average Score59.59	18	18d ago
LiveBench	phi-balancing	Accuracy46.83	15	2mo ago
All Benchmarks	RESMERGE	Overall Average Score53.74	12	1mo ago
MM-VET	ECSO	REC39.5	12	4mo ago
Aggregate Suite PIQA, HellaSwag, WinoGrande, ARC-e, ARC-c		Average Score69	10	4mo ago
Reasoning, Knowledge, and Biomedicine combined datasets (test)	Reasoning	Average Score60.47	9	4mo ago
RWQA	CoVT-7B	Score71.8	8	1mo ago
Downstream Suite	DCDM (MoE)	Average Score39.38	8	2mo ago
ChartQA	DeepLatent-RL-7B	Score86.4	7	1mo ago
Visual Probe-H	DeepLatent-RL-7B	Score38.7	7	1mo ago
Average Downstream Benchmark Suite	DoGraph	Average Accuracy37.9	7	3mo ago
Datacomp small (38 tasks)	Final Self-Filtered 30% Data Mix	Average Score19.7	6	1mo ago
LiveBench 1125	General Teacher	Score52.1	6	1mo ago
Instruction Tuning Suite (BIG-bench Hard, MMLU, TyDi QA, MGSM)	Flan-PaLM 2 (L)	Average Score74.1	4	4mo ago
ExpSuite-Static Overall	ExpGraph	Average Score78.75	2	1mo ago

Showing 18 of 18 rows