Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Combined

Benchmarks

Task NameDataset NameSOTA ResultTrend
Relative Robustness AnalysisCombined Past Tense, OR-Bench, MMLU
R-Score78.9
36
ReasoningCombined 37 Tasks (test)
Accuracy72.4
28
ReasoningCombined 107 Tasks (train)
Accuracy68.8
28
Question AnsweringCombined 7 Datasets
Average Score45
18
Harmful prompt detectionCombined Average
F1 Score (Combined Average)90.18
17
Question AnsweringCombined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle
Total Score280.3
15
All-in-One Image RestorationCombined (Deraining, Desnowing, Dehazing)
PSNR34.02
13
Subject-Level DetectionCombined (ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA)
Accuracy78.23
12
Segment-Level ClassificationCombined ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA
Accuracy68.03
12
Bayesian neural network regressionCombined (test)
RMSE3.925
12
General Video UnderstandingCombined (VideoMME, LVBench, LongVideoBench, EgoSchema, MLVU)
Average Score64.9
11
Negative Concept SuppressionCombined LLM-generated + COCO-derived
Suppression Rate85.25
10
Multi-turn attack detectionCombined LMSYS, SafeDialBench, Synthetic (held-out)
Detection Accuracy99
10
Malicious Prompt DetectionCombined All Datasets (test)
ASR4.5
6
Language Understanding and ReasoningCombined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA)
Average Score72.94
5
Probabilistic CalibrationCombined 20K labeled samples
Brier Score0.0759
5
ClassificationCombined (label-stratified)
AUROC0.986
4
Zero-shot Language UnderstandingCombined Zero-shot
Average Accuracy63.71
4
Visualization GenerationCombined (C)
Compilation Success Rate100
4
Brain Tumor ClassificationCombined 4 datasets
Accuracy99.5
3
Data-to-text generationCombined
FE8.05
3
Shadow DetectionCombined Dataset
Testing Time (hours)0.55
3
Landmark DetectionCombined
MRE1.02
2
Attack DetectionCombined (label-stratified)
AUROC97.1
1
Ranking Method EvaluationCombined AIME'24 AIME'25 HMMT'25 BrUMO'25
Mean Kendall's tau_b Correlation0.962
1
Showing 25 of 26 rows