Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

fev-bench: A Realistic Benchmark for Time Series Forecasting

About

Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly with the rise of pretrained models. Existing benchmarks often have limited domain coverage or overlook real-world settings such as tasks with covariates. Their aggregation procedures frequently lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks lack consistent evaluation infrastructure or are too rigid for integration into existing pipelines. To address these gaps, we propose fev-bench, a benchmark of 100 forecasting tasks across seven domains, including 46 with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for forecasting evaluation emphasizing reproducibility and integration with existing workflows. Using fev, fev-bench employs principled aggregation with bootstrapped confidence intervals to report performance along two dimensions: win rates and skill scores. We report results on fev-bench for pretrained, statistical, and baseline models and identify promising future research directions.

Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Time Series Forecastingfev-bench
Average Win Rate64.1
25
Point forecastingfev-bench v0.7.0
Win Rate58.5
19
Probabilistic Marginal Forecastingfev-bench marginal forecasting v0.7.0
Win Rate64.1
19
Time Series Forecastingfev-bench 100 tasks
Win Rate58.5
12
Showing 4 of 4 rows

Other info

Follow for update