Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aggregated

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Language EvaluationAggregated MMLU, BoolQ, OpenBookQA, RTE
Average Accuracy70.4
42
Video understandingAggregated Average Score
Average Score62.7
26
Feature SelectionAggregated AL, CH, CO, EY, GE, HE, HI, HO, JA, MI, OT, YE
Rank2.17
17
General Language ProficiencyAggregated GSM8K, TruthfulQA, TriviaQA, CNN/DM, MMLU
Average Score48.6
9
General PerformanceAggregated MMLU, HellaSwag, TruthfulQA, GSM8K, MATH, MBPP, HumanEval
Average Score40.35
9
Context Compression for Question AnsweringAggregated NQ, TQA, HQA, 2Wiki, Musique
EM34
8
DisentanglementAggregated
InfoM0.76
8
DisentanglementAggregated (Shapes3D, MPI3D, Falcor3D, Isaac3D)
InfoM Score0.65
5
Faithfulness DiagnosticityAggregated SST, Ev.Inf, AG, and M.RC
Alpha Score0.525
4
Instance-level searchAggregated Mean All & Mean R1M (test)
Mean All0.601
2
Showing 10 of 10 rows