Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Polynomial Mixing for Efficient Self-supervised Speech Encoders

About

State-of-the-art speech-to-text models typically employ Transformer-based encoders that model token dependencies via self-attention mechanisms. However, the quadratic complexity of self-attention in both memory and computation imposes significant constraints on scalability. In this work, we propose a novel token-mixing mechanism, the Polynomial Mixer (PoM), as a drop-in replacement for multi-head self-attention. PoM computes a polynomial representation of the input with linear complexity with respect to the input sequence length. We integrate PoM into a self-supervised speech representation learning framework based on BEST-RQ and evaluate its performance on downstream speech recognition tasks. Experimental results demonstrate that PoM achieves a competitive word error rate compared to full self-attention and other linear-complexity alternatives, offering an improved trade-off between performance and efficiency in time and memory.

Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER4.52
1156
Speech RecognitionLibrispeech other (test)
WER11.33
105
Showing 2 of 2 rows

Other info

Follow for update