Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

About

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin• 2026

Related benchmarks

Task	Dataset	Result
Long-Form Speech Understanding	AudioMarathon 1.0 (test)	Average Score54.7	16
Audio Classification	AudioMarathon 1.0 (test)	SED Score53.4	15
Speaker Information Modeling	AudioMarathon 1.0 (test)	SD (Score)33.3	15
Speech Content Extraction	AudioMarathon 1.0 (test)	SER31.1	15
Speech Recognition	LongSpeech	WER11.4	8
Content Separation	LongSpeech	N.A Score72.84	5
Emotion Analysis	LongSpeech	St.A51.94	5
Speaker Count	LongSpeech	Speaker Count Metric (N.A.)84.72	5
Summary	LongSpeech	ROUGE-149.72	5
Temporal Issue Localization	LongSpeech	St.A16.09	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord