Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Macro Average

Benchmarks

Task NameDataset NameSOTA ResultTrend
General Task PerformanceMacro-average (Mathematics, Multi-Hop QA, Code Generation)
Accuracy69
21
Re-identificationMacro Average (Across Datasets)
AUC96.9
18
Mathematical ReasoningMacro Average AIME2024, MATH, Minerva, Olympiad-Bench
Pass@155
16
Mathematical ReasoningMacro Average Selected Benchmarks
Pass@1 (Avg@32)52.8
14
Text ClassificationMacro-Average
Mean Accuracy82.14
11
RegressionMacro-average SICKR-STS, STS-B, WMT_RU_EN, WMT_EN_ZH, WMT_SI_EN (test)
Pearson Correlation (r)76.3
11
Mathematical ReasoningMacro-average
Avg@836.6
10
Question Answering and ReasoningMacro-average (MMLU, MATH, GSM8K, BBH)
Cost Reduction46
8
Graph-based Agent Memory PoisoningMacro Average (PubMedQA, WebShop, ToolEmu)
Utilization (Util.)98.4
5
Procedural PlanningMacro Average Zero-shot
Macro Accuracy (Zero-shot)69.7
4
Procedural PlanningMacro Average In-domain
Macro Accuracy56.3
4
Showing 11 of 11 rows