Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Soundwave: Less is More for Speech-Text Alignment in LLMs

About

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER5
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.1
833
Vocal Sound ClassificationVocalSound
Accuracy90.5
21
Speech Emotion RecognitionMELD
Accuracy63.5
19
Sound FoundationAIR-Bench 1.0 (test)
Score62.1
13
Chat BenchmarkAIR-Bench
Score (Speech Domain)6.41
11
Speech TranslationCoVoST2 En-De
BLEU30.6
10
Speech FoundationAIR-Bench Speech Foundation
Speech Grounding5.92e+3
7
Speech ChatAIR-Bench 1.0 (test)
Overall Score6.51
7
Music Foundation TasksAIR-Bench Music 1.0 (test)
Inst. Classification Acc37.1
7
Showing 10 of 15 rows

Other info

Code

Follow for update