Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

3 datasets

Benchmarks

Task NameDataset NameSOTA ResultTrend
Time series forecasting3 datasets averaged (test)
MAE0.08
22
Selective Prediction3 datasets (mean over all 21 runs)
AUROC0.7895
16
LLM-as-a-Judge Routing3 datasets Average (test)
Accuracy90
12
Selective Prediction3 datasets (Trace)
AUROC0.7984
8
Showing 4 of 4 rows