3 datasets

Benchmarks

Task Name	Dataset Name	SOTA Result
Time series forecasting	3 datasets averaged (test)	MAE0.08	22
Selective Prediction	3 datasets (mean over all 21 runs)	AUROC0.7895	16
LLM-as-a-Judge Routing	3 datasets Average (test)	Accuracy90	12
Selective Prediction	3 datasets (Trace)	AUROC0.7984	8

Showing 4 of 4 rows