VarDrop: Enhancing Training Efficiency by Reducing Variate Redundancy in Periodic Time Series Forecasting
About
Variate tokenization, which independently embeds each variate as separate tokens, has achieved remarkable improvements in multivariate time series forecasting. However, employing self-attention with variate tokens incurs a quadratic computational cost with respect to the number of variates, thus limiting its training efficiency for large-scale applications. To address this issue, we propose VarDrop, a simple yet efficient strategy that reduces the token usage by omitting redundant variate tokens during training. VarDrop adaptively excludes redundant tokens within a given batch, thereby reducing the number of tokens used for dot-product attention while preserving essential information. Specifically, we introduce k-dominant frequency hashing (k-DFH), which utilizes the ranked dominant frequencies in the frequency domain as a hash value to efficiently group variate tokens exhibiting similar periodic behaviors. Then, only representative tokens in each group are sampled through stratified sampling. By performing sparse attention with these selected tokens, the computational cost of scaled dot-product attention is significantly alleviated. Experiments conducted on public benchmark datasets demonstrate that VarDrop outperforms existing efficient baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multivariate Time-series Forecasting | Weather | MSE0.261 | 340 | |
| Multivariate Time-series Forecasting | Traffic | MSE0.396 | 264 | |
| Multivariate long-term time series forecasting | Solar Energy | MSE0.236 | 79 | |
| Multivariate Time-series Forecasting | Electricity | MAE0.245 | 73 | |
| Time Series Forecasting | Electricity (test) | Memory Footprint (GB)2.22 | 6 | |
| Multivariate Time-series Forecasting | Electricity, Traffic, Weather, Solar-Energy Aggregate | Overall MSE0.277 | 6 | |
| Time Series Forecasting | Electricity (test) | Training Time (ms/iteration)30.8 | 5 | |
| Time Series Forecasting | Traffic (test) | Training Time (ms)72.7 | 5 | |
| Time Series Forecasting | Weather (test) | Training Time (ms/iteration)12.6 | 5 | |
| Time Series Forecasting | Solar-Energy (test) | Training Time (ms/iteration)15.9 | 5 |