Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

About

Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.

Yifan Yang, Bing Han, Hui Wang, Wei Wang, Ziyang Ma, Long Zhou, Zengrui Jin, Guanrou Yang, Tianrui Wang, Xu Tan, Xie Chen• 2026

Related benchmarks

Task	Dataset	Result
Emotion Recognition	IEMOCAP	--	151
Speech Emotion Recognition	RAVDESS	Unweighted Accuracy46	43
Emotion Recognition	CREMA-D	WA (Weighted Average)35.1	12
Age Classification	CREMA-D	WA40.6	5
Gender Classification	RAVDESS	Weighted Accuracy100	5
Speech Style Similarity Scoring	ParaSpeech-Caps (holdout)	Pearson Corr (r) [Intrinsic]0.893	4
Speech-to-Text Retrieval	ParaSpeechCaps Global captions (test)	R@145.6	4
Speech-to-Text Retrieval	ParaSpeechCaps Fine-Grained captions (test)	R@168.1	4
Text-to-Speech Retrieval	ParaSpeechCaps Global captions (test)	Recall@140.3	4
Text-to-Speech Retrieval	ParaSpeechCaps Fine-Grained captions (test)	R@167.2	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord