Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Composite

Benchmarks

Task NameDataset NameSOTA ResultTrend
Image Captioning EvaluationComposite
Kendall-c Tau_c66
92
Property PredictionComposite
RMSE (Yield)139.532
24
Caption-level correlation with human judgmentComposite (test)
Kendall's Tau0.6
21
Correlation with human judgmentsComposite (test)
Kendall's Tau-c57.6
18
Image Captioning EvaluationCOMPOSITE (COM) (test)
Kendall's tau-c64.2
17
Correlation with human judgmentComposite 1 (test)
Kendall Tau-c57.3
15
Agent & AlignmentComposite IFEval-strict-prompt, BFCL v3, CodeIF-Bench, Nexus FC
IFEval Strict Prompt Score86.9
4
MathComposite (GSM8K, MATH, OlympiadBench, AIME 2025, HARDMath2, Omni-MATH, GSM-Plus, CMATH)
GSM8K94.62
4
CodingComposite CRUXEval-O, MBPP, MBPP+, MultiPL-E, HumanEval, HumanEval+, HumanEvalFix, HumanEval-cn, BigCodeBench-Full, LiveCodeBench, Aider, BIRD-SQL, Spider
CRUXEval-O Score76.12
4
ReasoningComposite (BIG-Bench Hard, BIG-Bench Extra Hard, bbh-zh, MuSR, ZebraLogic, PrOntoQA, PIQA, OCNLI, HellaSwag, KOR-Bench, DROP, SQuAD 2.0)
BBH83.7
4
Knowledge EvaluationComposite (MMLU, MMLU-Pro, CMMLU, C-EVAL, GAOKAO-Bench, ARC-c, GPQA, SciBench, PHYBench, TriviaQA)
Overall Average Score65.77
4
Showing 11 of 11 rows