Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Aggregated

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Language EvaluationAggregated MMLU, BoolQ, OpenBookQA, RTE
Average Accuracy70.4
22
Feature SelectionAggregated AL, CH, CO, EY, GE, HE, HI, HO, JA, MI, OT, YE
Rank2.17
17
General Language ProficiencyAggregated GSM8K, TruthfulQA, TriviaQA, CNN/DM, MMLU
Average Score48.6
9
General PerformanceAggregated MMLU, HellaSwag, TruthfulQA, GSM8K, MATH, MBPP, HumanEval
Average Score40.35
9
DisentanglementAggregated
InfoM0.76
8
Faithfulness DiagnosticityAggregated SST, Ev.Inf, AG, and M.RC
Alpha Score0.525
4
Instance-level searchAggregated Mean All & Mean R1M (test)
Mean All0.601
2
Showing 7 of 7 rows