Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Multi-task Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Multi-task Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
PostTrainBench
Official Instruct Models
AIME 25 Score
29.17
41
1d ago
Aggregate (AIME25, AIME24, MATH, GSM8K, HumanEval+, MBPP+, MedQA, GPQA-Diamond)
Primitives-based MAS
Average Accuracy
77.6
21
3mo ago
Aggregate (GSM8K, BFCL, Spider, HumanEval)
RLSTA
Average Accuracy
79.4
20
7d ago
Aggregated Downstream Benchmark Suite math_500, GSM8K, IFEval, ARC, HellaSwag, MMLU, MBPP+, HumanEval+, WildGuard, HarmBench
TA
Mean Downstream Accuracy
50.7
20
21d ago
Aggregate All tasks (summary)
EvoRoute
Score
74.6
20
3mo ago
Fairness and Utility Suite
Self-Debias Iter2 + Self-Correction
Average Score
82.1
16
1mo ago
Average GSM8K, HumanEval, ARC-c
ReMix
Accuracy
60.77
13
2mo ago
Aggregated Clinical Tasks
Full FT
Average Score
74.6
12
1mo ago
Aggregate (LAMBADA, HellaSwag, PIQA, ARC, WinoGrande) (various)
Mistral (Full-Attention)
Avg Accuracy
51.9
10
3mo ago
General and Medical Aggregation
Anchored Learning
Overall Average
51.8
9
27d ago
Aggregate (HealthQA, ARC-C, PopQA, Squad1, ASQA)
AMATA 8B
Average Score
62.8
8
15d ago
Average (GSM8K-CoT, MATH, MBPP, HumanEval)
TAD-Q
Accuracy
51.6
7
22d ago
Mixed Benchmark Average
phi-balancing
Average Accuracy
40.86
6
16d ago
Average (test)
NA-SFT
Hit Score
52.45
6
3mo ago
VLM Tasks Average
Default
Accuracy
86.7
3
2mo ago
InternVL2-26B Task Suite Zeroing
TaLo
Persuasive Strategies Score
58.3
2
3mo ago
Showing 16 of 16 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs