Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Average of Benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Model MergingAverage of 8 benchmarks
Average Accuracy52.79
72
Best-of-N RerankingAverage of 7 benchmarks (including AIME24, LeetCode) (test)
Average Accuracy52
42
Knowledge Assessment and Commonsense ReasoningAverage of 8 Benchmarks (ARC-C, ARC-E, BoolQ, HellaS, LamOp, Piqa, WinoG, MMLU)
Accuracy72.99
10
Showing 3 of 3 rows