Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Combined Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Holistic EvaluationCombined Suite General Reasoning Perception Text
Text Average76.3
13
LLM AlignmentCombined Suite Setup 3
Average Percentage Score54.38
9
Overall Performance EvaluationCombined Suite (MME, MMStar, SQA, RealWorldQA, MMMU, MMMU-P, VisuLogic, LogicVista, CRPE, POPE, HallBench)
Average Score43.94
4
General Language ModelingCombined Suite (HS, PIQA, SIQA, Wino, MMLU, NQ, TQA, ARC-C, ARC-E, OBQA, BoolQ, DROP, BBH-LB, GSM8K)
Accuracy57.8
4
Knowledge-Preserved AdaptationCombined Suite TriviaQA, NQ open, WebQS, HumanEval, MBPP
Average Score21.91
4
Showing 5 of 5 rows