Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TimeDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

About

Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

Juntong Ni, Zewen Liu, Shiyu Wang, Ming Jin, Wei Jin• 2025

Related benchmarks

TaskDatasetResultRank
Long-term time-series forecastingETTh1
MAE0.441
575
Long-term time-series forecastingWeather
MSE0.221
525
Long-term time-series forecastingETTm1
MSE0.348
461
Long-term time-series forecastingETTh2
MSE0.345
461
Long-term time-series forecastingETTm2
MSE0.25
455
Long-term time-series forecastingTraffic
MSE0.387
427
Long-term time-series forecastingsolar
MSE0.184
66
Long-term time-series forecastingElectricity
MSE0.157
22
Long-term forecastingWeather horizon 96
MSE0.145
21
Showing 9 of 9 rows

Other info

Follow for update