Soundwave: Less is More for Speech-Text Alignment in LLMs

About

Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.1	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER5	1206
Speech Emotion Recognition	MELD	Accuracy63.5	24
Vocal Sound Classification	VocalSound	Accuracy90.5	21
Sound Foundation	AIR-Bench 1.0 (test)	Score62.1	13
Chat Benchmark	AIR-Bench	Score (Speech Domain)6.41	11
Speech Translation	CoVoST2 En-De	BLEU30.6	10
Speech Foundation	AIR-Bench Speech Foundation	Speech Grounding5.92e+3	7
Speech Chat	AIR-Bench 1.0 (test)	Overall Score6.51	7
Music Foundation Tasks	AIR-Bench Music 1.0 (test)	Inst. Classification Acc37.1	7

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord