Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

About

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin• 2026

Related benchmarks

TaskDatasetResultRank
Long-Form Speech UnderstandingAudioMarathon 1.0 (test)
Average Score54.7
16
Audio ClassificationAudioMarathon 1.0 (test)
SED Score53.4
15
Speaker Information ModelingAudioMarathon 1.0 (test)
SD (Score)33.3
15
Speech Content ExtractionAudioMarathon 1.0 (test)
SER31.1
15
Speech RecognitionLongSpeech
WER11.4
8
Content SeparationLongSpeech
N.A Score72.84
5
Emotion AnalysisLongSpeech
St.A51.94
5
Speaker CountLongSpeech
Speaker Count Metric (N.A.)84.72
5
SummaryLongSpeech
ROUGE-149.72
5
Temporal Issue LocalizationLongSpeech
St.A16.09
5
Showing 10 of 10 rows

Other info

Follow for update