Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

About

The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR100
Accuracy81.98
301
Image ClassificationCIFAR10
Accuracy (%)96.53
282
Image ClassificationImageNet-1k (val)
Top-1 Accuracy82.96
18
Event-based Image ClassificationCIFAR10-DVS
Accuracy84
8
Gesture RecognitionDVS128
Accuracy98.8
7
Showing 5 of 5 rows

Other info

Follow for update