TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants
About
Speech enhancement (SE) is critical for improving speech intelligibility and quality in real-world environments, particularly for cochlear implant (CI) users who experience severe degradations in speech understanding under noisy and reverberant conditions. In this study, we propose TokenSE, a discrete token-based SE framework operating in the neural audio codec space, which predicts clean codec token indices from degraded speech using a Mamba-based model. Unlike the earlier Transformer architecture, whose self-attention mechanism has a computational complexity that grows quadratically with sequence length, the input-dependent selection mechanism of Mamba achieves linear complexity, making it a compelling alternative to Transformers, especially for CI and hearing-aid (HA) applications. Objective evaluations show that TokenSE consistently outperforms baseline methods on both in-domain and out-of-domain datasets. Moreover, subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Enhancement | DNS Challenge Real Recordings (test) | SIG Score3.49 | 32 | |
| Speech Enhancement | DNS Challenge With Reverb (test) | SIG3.643 | 24 | |
| Speech Enhancement | DNS Challenge Without Reverb (test) | -- | 14 | |
| Speech Enhancement | TIMIT OOD, with-reverberation, T60 = 0.7s, 5 dB SNR | SIG Score3.454 | 3 | |
| Speech Enhancement | TIMIT noisy-only, 0 dB SNR (OOD) | SIG Score3.514 | 3 | |
| Speech Enhancement | TIMIT noisy-only, 5 dB SNR (OOD) | SIG Score3.486 | 3 | |
| Speech Enhancement | TIMIT OOD with-reverberation T60 = 0.5s 5 dB SNR | SIG Score3.505 | 3 |