Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

About

While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.

Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.58
1207
Automatic Speech RecognitionLibriSpeech (test-other)
WER7.75
1206
Automatic Speech RecognitionAISHELL-1 (test)
CER5.46
105
Automatic Speech RecognitionWenetSpeech Meeting (test)
CER18.18
78
Automatic Speech RecognitionWenetSpeech Net (test)
CER16.5
57
Speech-to-Speech Question-AnsweringLlama Questions
Accuracy73.33
27
Speech-to-Text Question-AnsweringLlamaQ
Accuracy86
26
Speech-to-Text Question-AnsweringTriviaQA
Accuracy71.8
26
Speech-to-Text Question-AnsweringWebQ
Accuracy66.6
26
Speech-to-Speech Question-AnsweringWebQ
Accuracy46.2
25
Showing 10 of 17 rows

Other info

Follow for update