Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Overall Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Benchmark AggregationOverall Evaluation Suite
Math Average59.8
21
Aggregated Programming Capability EvaluationOverall Evaluation Suite
Macro Average Score64.2
10
General Language ModelingOverall Evaluation Suite
Average Score73.6
4
Showing 3 of 3 rows