Average of 9 tasks

Benchmarks

Task Name	Dataset Name	SOTA Result	Trend
Instruction following and reasoning	Average of 9 tasks (DollyEval, VicunaEval, GSM8K, MATH, AIME2024, HumanEval, MBPP, LiveCodeBench, GPQA-D)	Average Performance31.19		9

Showing 1 of 1 rows