Spiking Transformer with Spatial-Temporal Attention

About

Spike-based Transformer presents a compelling and energy-efficient alternative to traditional Artificial Neural Network (ANN)-based Transformers, achieving impressive results through sparse binary computations. However, existing spike-based transformers predominantly focus on spatial attention while neglecting crucial temporal dependencies inherent in spike-based processing, leading to suboptimal feature representation and limited performance. To address this limitation, we propose Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture that efficiently integrates both spatial and temporal information in the self-attention mechanism. STAtten introduces a block-wise computation strategy that processes information in spatial-temporal chunks, enabling comprehensive feature capture while maintaining the same computational complexity as previous spatial-only approaches. Our method can be seamlessly integrated into existing spike-based transformers without architectural overhaul. Extensive experiments demonstrate that STAtten significantly improves the performance of existing spike-based transformers across both static and neuromorphic datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. The code is available at https://github.com/Intelligent-Computing-Lab-Yale/STAtten

Donghyun Lee, Yuhang Li, Youngeun Kim, Shiting Xiao, Priyadarshini Panda• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy96.03	882
Image Classification	CIFAR-10	--	875
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy61.7	770
Action Recognition	NTU RGB+D (Cross-View)	Accuracy79.7	652
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy72.8	500
Image Classification	CIFAR-100 (test)	Top-1 Accuracy80.2	395
Image Classification	CIFAR-100	Accuracy79.85	357
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy60.3	241
Skeleton-based Action Recognition	NTU RGB+D (Cross-View)	Accuracy79.7	213
Skeleton-based Action Recognition	NTU RGB+D 120 Cross-Subject	Top-1 Accuracy60.3	143

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord