Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

In-distribution

Benchmarks

Task NameDataset NameSOTA ResultTrend
Harmfulness DetectionFull in-distribution (test)
AUROC0.964
63
Error DetectionIn-distribution (test)
AUC0.8916
40
Mathematical ReasoningIn-Distribution Avg
Average Score45.6
29
Debiasing EffectivenessIn-Distribution (ID)
Mean Effectiveness Score (ID)10.2
16
Object type predictionIn-Distribution (ID)
Accuracy (ID)100
9
Reasoning step reductionIn-Distribution 5K corpus (test)
Savings Rate47.5
9
Point TrackingIn-distribution
Avg Displacement Error57.4
6
Prompt Injection DetectionIn-distribution (ID) (test)
Macro F1 Score95.41
5
Text-to-SpeechIn-distribution ID (test)
MOS3.87
5
Metasurface inverse designIn-Distribution (test)
SG74
2
Showing 10 of 10 rows