Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Combined

Benchmarks

Task NameDataset NameSOTA ResultTrend
Relative Robustness AnalysisCombined Past Tense, OR-Bench, MMLU
R-Score78.9
36
ReasoningCombined 37 Tasks (test)
Accuracy72.4
28
ReasoningCombined 107 Tasks (train)
Accuracy68.8
28
Question AnsweringCombined 7 Datasets
Average Score45
18
Question AnsweringCombined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle
Total Score280.3
15
All-in-One Image RestorationCombined (Deraining, Desnowing, Dehazing)
PSNR34.02
13
Bayesian neural network regressionCombined (test)
RMSE3.939
6
Malicious Prompt DetectionCombined All Datasets (test)
ASR4.5
6
Language Understanding and ReasoningCombined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA)
Average Score72.94
5
Probabilistic CalibrationCombined 20K labeled samples
Brier Score0.0759
5
Visualization GenerationCombined (C)
Compilation Success Rate100
4
Data-to-text generationCombined
FE8.05
3
Shadow DetectionCombined Dataset
Testing Time (hours)0.55
3
Ranking Method EvaluationCombined AIME'24 AIME'25 HMMT'25 BrUMO'25
Mean Kendall's tau_b Correlation0.962
1
Ranking Correlation AnalysisCombined AIME'24 AIME'25 HMMT'25 BrUMO'25
Kendall's tau_b (vs Gold Standard)0.865
1
Showing 15 of 15 rows