Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aggregated Benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
General PerformanceAggregated Benchmarks
Overall Average49.76
22
Quantization Performance SummaryAggregated Benchmarks HellaSwag, MMLU, Arc-C, MATH-500
Average Score1.014
22
Meta-EvaluationAggregated Benchmarks (AIME, ARC, GSM8K, HE, MMLU, IT, RU, BFCL)
Overall Average Rank2.4
16
General Multimodal UnderstandingAggregated Benchmarks
Average Score71
13
Summary EvaluationAggregated Benchmarks
Average Score (3-item avg)42.2
12
Reward ModelingAggregated Benchmarks Macro
Average Score (excl. MM-RB, VL-RB)74.74
12
General Language EvaluationAggregated Benchmarks
Average Score0.7449
10
Overall Language Model EvaluationAggregated Benchmarks STEM Code IF General
Average Score61.7
7
General Multitask EvaluationAggregated Benchmarks Math500, GPQA, HumanEval, MBPP, AE2 LC
Average Score40.7
5
Showing 9 of 9 rows