Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
About
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-Form Speech Understanding | AudioMarathon 1.0 (test) | Average Score54.7 | 16 | |
| Audio Classification | AudioMarathon 1.0 (test) | SED Score53.4 | 15 | |
| Speaker Information Modeling | AudioMarathon 1.0 (test) | SD (Score)33.3 | 15 | |
| Speech Content Extraction | AudioMarathon 1.0 (test) | SER31.1 | 15 | |
| Speech Recognition | LongSpeech | WER11.4 | 8 | |
| Content Separation | LongSpeech | N.A Score72.84 | 5 | |
| Emotion Analysis | LongSpeech | St.A51.94 | 5 | |
| Speaker Count | LongSpeech | Speaker Count Metric (N.A.)84.72 | 5 | |
| Summary | LongSpeech | ROUGE-149.72 | 5 | |
| Temporal Issue Localization | LongSpeech | St.A16.09 | 5 |