Spiking Wavelet Transformer
About
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning by emulating the event-driven processing manner of the brain. Incorporating Transformers with SNNs has shown promise for accuracy. However, they struggle to learn high-frequency patterns, such as moving edges and pixel-level brightness changes, because they rely on the global self-attention mechanism. Learning these high-frequency representations is challenging but essential for SNN-based event-driven vision. To address this issue, we propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner by leveraging the sparse wavelet transform. The critical component is a Frequency-Aware Token Mixer (FATM) with three branches: 1) spiking wavelet learner for spatial-frequency domain learning, 2) convolution-based learner for spatial feature extraction, and 3) spiking pointwise convolution for cross-channel information aggregation - with negative spike dynamics incorporated in 1) to enhance frequency representation. The FATM enables the SWformer to outperform vanilla Spiking Transformers in capturing high-frequency visual components, as evidenced by our empirical results. Experiments on both static and neuromorphic datasets demonstrate SWformer's effectiveness in capturing spatial-frequency patterns in a multiplication-free and event-driven fashion, outperforming state-of-the-art SNNs. SWformer achieves a 22.03% reduction in parameter count, and a 2.52% performance improvement on the ImageNet dataset compared to vanilla Spiking Transformers. The code is available at: https://github.com/bic-L/Spiking-Wavelet-Transformer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Skeleton-based Action Recognition | NTU RGB+D (Cross-View) | Accuracy81.2 | 213 | |
| Skeleton-based Action Recognition | NTU RGB+D 120 Cross-Subject | Top-1 Accuracy63.5 | 143 | |
| Skeleton-based Action Recognition | NTU-RGB+D 120 (Cross-setup) | Accuracy64.7 | 136 | |
| Skeleton-based Action Recognition | NTU RGB+D (Cross-subject) | Accuracy74.7 | 123 | |
| Skeleton-based Action Recognition | NW-UCLA | Accuracy86.7 | 44 | |
| Image Classification | CIFAR10 standard (test) | Top-1 Accuracy95.31 | 35 | |
| Image Classification | CIFAR100 standard (test) | Top-1 Accuracy76.99 | 13 |