Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aggregated

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multitask LLM EvaluationAggregated MMLU, GSM8K, HumanEval
Average Accuracy88.33
42
General Language EvaluationAggregated MMLU, BoolQ, OpenBookQA, RTE
Average Accuracy70.4
42
Video understandingAggregated Average Score
Average Score62.7
36
Feature SelectionAggregated AL, CH, CO, EY, GE, HE, HI, HO, JA, MI, OT, YE
Rank2.17
17
Overall PerformanceAggregated All Benchmarks
Average Score40.3
12
General Language ProficiencyAggregated GSM8K, TruthfulQA, TriviaQA, CNN/DM, MMLU
Average Score48.6
9
General PerformanceAggregated MMLU, HellaSwag, TruthfulQA, GSM8K, MATH, MBPP, HumanEval
Average Score40.35
9
Context Compression for Question AnsweringAggregated NQ, TQA, HQA, 2Wiki, Musique
EM34
8
DisentanglementAggregated
InfoM0.76
8
DisentanglementAggregated (Shapes3D, MPI3D, Falcor3D, Isaac3D)
InfoM Score0.65
5
General Reasoning EfficiencyAggregated (Sudoku, Maze, ARC, DDE)
Fp Score3.04
4
Faithfulness DiagnosticityAggregated SST, Ev.Inf, AG, and M.RC
Alpha Score0.525
4
Instance-level searchAggregated Mean All & Mean R1M (test)
Mean All0.601
2
Showing 13 of 13 rows