Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need?

About

Self-attention scales quadratically with sequence length, limiting transformer-based speech models on edge devices. We introduce the Learnable Pulse Accumulator (LPA), an O(n) replacement that substitutes key-query dot products with learned gating functions: content-dependent rectangular pulses, periodic windows, and position-dependent basis functions. An MSE diagnostic sweep determines per-layer replacement difficulty and ordering. Replacing 8 of 12 wav2vec2-base layers yields 10.61% word error rate (WER) on LibriSpeech test-clean, +7.24 percentage points (pp) over the 3.37% baseline, with 3.27x speedup at 120s audio on Apple M4 Pro via an optimized MLX inference path. Cross-domain validation on SepFormer speech enhancement shows all 16 intra-chunk attention layers can be replaced without collapse, suggesting the depth wall arises from linguistic computation rather than an LPA limitation. LPA's near-binary gates at inference enable dense GPU computation with no CPU-GPU synchronization, and all operations map to mobile neural accelerators.

Yakov Pyotr Shkolnikov• 2026

Related benchmarks

Task	Dataset	Result
Speech Recognition	LibriSpeech clean (dev)	--	104
Agitation score prediction	Bridge2AI (speaker-independent CV)	Pearson Correlation (ρ)0.22	21
Inference Speed	Audio 10s (test)	Inference Time (ms)42	5
Inference Speed	Audio 30s (test)	Inference Time (ms)125	5
Inference Speed	Audio 60s (test)	Inference Time (ms)244	5
Inference Speed	Audio 120s (test)	Inference Time (ms)479	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord