SpikeBERT: A Language Spikformer Learned from BERT with Knowledge Distillation
About
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way. However, the network architectures of existing SNNs for language tasks are still simplistic and relatively shallow, and deep architectures have not been fully explored, resulting in a significant performance gap compared to mainstream transformer-based networks such as BERT. To this end, we improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks and propose a two-stage knowledge distillation method for training it, which combines pre-training by distilling knowledge from BERT with a large collection of unlabelled texts and fine-tuning with task-specific instances via knowledge distillation again from the BERT fine-tuned on the same training examples. Through extensive experimentation, we show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese with much less energy consumption. Our code is available at https://github.com/Lvchangze/SpikeBERT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Classification | SST-2 (test) | Accuracy81.71 | 185 | |
| Classification | CIFAR10-DVS | Accuracy76.4 | 133 | |
| Subjectivity Classification | Subj (test) | Accuracy91.6 | 125 | |
| Text Classification | MR (test) | Accuracy75.87 | 99 | |
| Text Classification | SST-5 (test) | Accuracy41.84 | 58 | |
| Time Series Forecasting | METR-LA | -- | 39 | |
| Time Series Forecasting | solar | R2 (6h)0.929 | 19 | |
| Time Series Forecasting | PEMS-BAY | R2 (Horizon 6)0.768 | 19 | |
| Time Series Forecasting | Electricity | R2 (Horizon 6)0.964 | 12 | |
| Text Classification | ChnSenti (test) | Accuracy85.62 | 5 |