Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Shared

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM EvaluationShared (evaluation)
Tie-aware Accuracy78
10
Multi-judge evaluationShared 500-prompt sample
Global Correlation (r)0.87
5
Calibration and DiscriminationShared pooled aggregation (test)
Brier Score (BS)0.1
4
Showing 3 of 3 rows