Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

About

Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA's near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.

Yakov Pyotr Shkolnikov• 2026

Related benchmarks

TaskDatasetResultRank
Speech RecognitionLibriSpeech clean (dev)--
80
Inference SpeedAudio 10s (test)
Inference Time (ms)42
5
Inference SpeedAudio 30s (test)
Inference Time (ms)125
5
Inference SpeedAudio 60s (test)
Inference Time (ms)244
5
Inference SpeedAudio 120s (test)
Inference Time (ms)479
5
Showing 5 of 5 rows

Other info

Follow for update