Late Fusion and Multi-Level Fission Amplify Cross-Modal Transfer in Text-Speech LMs

About

Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- are commonly trained through an early modality fusion/fission approach, in which both modalities are fed and predicted from a shared backbone via linear layers. We hypothesize that this approach limits cross-modal transfer by neglecting feature compositionality -- specifically, the finer-grained nature of speech representations compared to text -- preventing the emergence of a shared feature hierarchy within model layers. In this paper, we argue that this limitation can be addressed through late fusion and fission, with a fission process that accesses both high- and low-level features for speech generation. Our models implementing these principles, SmolTolk, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute, and achieve significantly improved cross-modal performance relative to early fusion/fission baselines. Representation analyses further suggest that our method enhances the model's ability to abstract higher-level, more semantic features from speech, and leads to increasingly shared representation spaces across layers.

Santiago Cuervo, Adel Moumen, Yanis Labrak, Sameer Khurana, Antoine Laurent, Mickael Rouvier, Phil Woodland, Ricard Marxer• 2025

Related benchmarks

Task	Dataset	Result	Rank
Semantic and linguistic knowledge evaluation	ZeroSpeech	sBLiMP Score61.9		20
Discourse-level coherence evaluation	Topic Story-Cloze (tSC)	tSC Score87.6		19

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord