FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

About

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67 tokens per second, achieving a 93 percent reduction in tokens compared to standard frame-level adapters, while preserving the critical context required for complex reasoning. Experimental results demonstrate that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks, despite operating with significantly lower FLOPs and parameter counts. Our findings establish that extreme token compression is a viable pathway to making real-time, long-context speech understanding feasible for LLMs, even under strict computational constraints. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3

Junseok Lee, Sangyong Lee, Chang-Jae Chun• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	OpenASR	WER (AMI)10.8	6
Automatic Speech Recognition (En)	OpenASR (test)	WER6.47	6
Automatic Speech Recognition	Fleurs	En WER5.26	6
Automatic Speech Recognition	Common Voice 15	English WER10.9	6
Spoken Question Answering (En)	LibriSQA	Accuracy69.5	5
Speech Summarization (En)	SDS-PART6	Subjective Score (1-7)5.4	5
Automatic Speech Translation (Ko2En)	Fleurs	BLEU19.5	4
Automatic Speech Translation (Ko2En)	Minds14	BLEU Score28.9	4
Automatic Speech Recognition (Ko)	Fleurs Common Voice 15	CER3.82	3
Automatic Speech Translation (En2Ko)	Fleurs	BLEU7.39	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord