MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

About

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

Ali Behrouz, Michele Santacatterina, Ramin Zabih• 2024

Related benchmarks

Task	Dataset	Result
Long-term time-series forecasting	Weather	MSE0.239	525
Long-term time-series forecasting	Traffic	MSE0.42	427
Long-term forecasting	ETTm1	MSE0.361	422
Long-term forecasting	ETTh1	MSE0.403	409
Long-term forecasting	ETTm2	MSE0.267	350
Long-term forecasting	ETTh2	MSE0.333	310
Long-term time-series forecasting	ECL	MSE0.169	163
Long-term forecasting	Exchange	MSE0.443	73

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord