MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
About
The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR100 | Accuracy81.98 | 301 | |
| Image Classification | CIFAR10 | Accuracy (%)96.53 | 282 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy82.96 | 18 | |
| Event-based Image Classification | CIFAR10-DVS | Accuracy84 | 8 | |
| Gesture Recognition | DVS128 | Accuracy98.8 | 7 |