FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation
About
Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67 tokens per second, achieving a 93 percent reduction in tokens compared to standard frame-level adapters, while preserving the critical context required for complex reasoning. Experimental results demonstrate that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks, despite operating with significantly lower FLOPs and parameter counts. Our findings establish that extreme token compression is a viable pathway to making real-time, long-context speech understanding feasible for LLMs, even under strict computational constraints. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | OpenASR | WER (AMI)10.8 | 6 | |
| Automatic Speech Recognition (En) | OpenASR (test) | WER6.47 | 6 | |
| Automatic Speech Recognition | Fleurs | En WER5.26 | 6 | |
| Automatic Speech Recognition | Common Voice 15 | English WER10.9 | 6 | |
| Spoken Question Answering (En) | LibriSQA | Accuracy69.5 | 5 | |
| Speech Summarization (En) | SDS-PART6 | Subjective Score (1-7)5.4 | 5 | |
| Automatic Speech Translation (Ko2En) | Fleurs | BLEU19.5 | 4 | |
| Automatic Speech Translation (Ko2En) | Minds14 | BLEU Score28.9 | 4 | |
| Automatic Speech Recognition (Ko) | Fleurs Common Voice 15 | CER3.82 | 3 | |
| Automatic Speech Translation (En2Ko) | Fleurs | BLEU7.39 | 3 |