Closing the Gap Between Text and Speech Understanding in LLMs
About
Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy76.9 | 1460 | |
| Question Answering | ARC Challenge | Accuracy89.2 | 749 | |
| Physical Commonsense Reasoning | PIQA | Accuracy80.3 | 329 | |
| Story completion | StoryCloze | Accuracy81.5 | 65 | |
| Commonsense Reasoning | StoryCloze | Accuracy84.9 | 34 | |
| Reasoning | VoiceBench | MMSU Accuracy (Audio)57.5 | 13 | |
| Science Question Answering | ARC-C | Accuracy84 | 11 | |
| Multi-task Knowledge | MMSU | Accuracy57.5 | 11 | |
| OpenBook Question Answering | OBQA | Accuracy0.767 | 11 | |
| Multi-task Language Understanding | MMSU | Accuracy71.6 | 6 |