Spiking Wavelet Transformer

About

Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by emulating the event-driven processing manner of the brain. Incorporating Transformers with SNNs has shown promise for accuracy. However, they struggle to learn high-frequency patterns, such as moving edges and pixel-level brightness changes, because they rely on the global self-attention mechanism. Learning these high-frequency representations is challenging but essential for SNN-based event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation - with negative spike dynamics incorporated in 1) to enhance frequency representation. The FATM enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free and event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves a 22.03% reduction in parameter count, and a 2.52% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers. The code is available at: https://github.com/bic-L/Spiking-Wavelet-Transformer.

Yuetong Fang, Ziqing Wang, Lingfeng Zhang, Jiahang Cao, Honglei Chen, Renjing Xu• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy96.1	882
Image Classification	CIFAR-10	--	875
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy64.7	770
Action Recognition	NTU RGB+D (Cross-View)	Accuracy81.2	652
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy74.7	500
Image Classification	CIFAR-100 (test)	Top-1 Accuracy79.3	395
Image Classification	CIFAR-100	--	357
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy63.5	241
Skeleton-based Action Recognition	NTU RGB+D (Cross-View)	Accuracy81.2	213
Skeleton-based Action Recognition	NTU RGB+D 120 Cross-Subject	Top-1 Accuracy63.5	143

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord