Polynomial Mixing for Efficient Self-supervised Speech Encoders

About

State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.

Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen• 2026

Related benchmarks

Task	Dataset	Result	Rank
Automatic Speech Recognition	LibriSpeech clean (test)	WER4.52		1207
Speech Recognition	Librispeech other (test)	WER11.33		105

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord