Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

About

While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.

Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER3.58	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER7.75	1206
Automatic Speech Recognition	AISHELL-1 (test)	CER5.46	105
Automatic Speech Recognition	WenetSpeech Meeting (test)	CER18.18	78
Automatic Speech Recognition	WenetSpeech Net (test)	CER16.5	57
Speech-to-Speech Question-Answering	Llama Questions	Accuracy73.33	27
Speech-to-Text Question-Answering	LlamaQ	Accuracy86	26
Speech-to-Text Question-Answering	TriviaQA	Accuracy71.8	26
Speech-to-Text Question-Answering	WebQ	Accuracy66.6	26
Speech-to-Speech Question-Answering	WebQ	Accuracy46.2	25

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord