MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion

About

The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.

Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR100	Accuracy81.98	301
Image Classification	CIFAR10	Accuracy (%)96.53	282
Image Classification	ImageNet-1k (val)	Top-1 Accuracy82.96	18
Event-based Image Classification	CIFAR10-DVS	Accuracy84	8
Gesture Recognition	DVS128	Accuracy98.8	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord