Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers
About
Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multivariate Time-series Forecasting | ETTh2 (test) | MSE0.272 | 171 | |
| Multivariate Time-series Forecasting | ETTh1 (test) | MSE0.343 | 134 | |
| Multivariate Time-series Forecasting | Weather (test) | MSE0.223 | 124 | |
| Multivariate Time-series Forecasting | ECL (test) | MSE0.164 | 77 | |
| Multivariate Time-series Forecasting | ETTm1 (test) | MSE0.343 | 67 | |
| Multivariate long-term forecasting | ETTh1 T=96 (test) | MSE0.343 | 48 | |
| Multivariate Time-series Forecasting | Traffic (test) | -- | 36 | |
| Multivariate Time-series Forecasting | ETTm2 (test) | MSE0.257 | 35 | |
| Long-term multivariate forecasting | ECL horizon 96 (test) | MSE0.132 | 22 | |
| Long-term multivariate forecasting | Weather Avg. (test) | MSE0.223 | 5 |