Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

About

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67 tokens per second, achieving a 93 percent reduction in tokens compared to standard frame-level adapters, while preserving the critical context required for complex reasoning. Experimental results demonstrate that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks, despite operating with significantly lower FLOPs and parameter counts. Our findings establish that extreme token compression is a viable pathway to making real-time, long-context speech understanding feasible for LLMs, even under strict computational constraints. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3

Junseok Lee, Sangyong Lee, Chang-Jae Chun• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionOpenASR
WER (AMI)10.8
6
Automatic Speech Recognition (En)OpenASR (test)
WER6.47
6
Automatic Speech RecognitionFleurs
En WER5.26
6
Automatic Speech RecognitionCommon Voice 15
English WER10.9
6
Spoken Question Answering (En)LibriSQA
Accuracy69.5
5
Speech Summarization (En)SDS-PART6
Subjective Score (1-7)5.4
5
Automatic Speech Translation (Ko2En)Fleurs
BLEU19.5
4
Automatic Speech Translation (Ko2En)Minds14
BLEU Score28.9
4
Automatic Speech Recognition (Ko)Fleurs Common Voice 15
CER3.82
3
Automatic Speech Translation (En2Ko)Fleurs
BLEU7.39
3
Showing 10 of 12 rows

Other info

Follow for update