Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
About
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Emotion Recognition | IEMOCAP | -- | 71 | |
| Speech Emotion Recognition | RAVDESS | Weighted Accuracy46.8 | 19 | |
| Emotion Recognition | CREMA-D | WA (Weighted Average)35.1 | 12 | |
| Age Classification | CREMA-D | WA40.6 | 5 | |
| Gender Classification | RAVDESS | Weighted Accuracy100 | 5 | |
| Speech Style Similarity Scoring | ParaSpeech-Caps (holdout) | Pearson Corr (r) [Intrinsic]0.893 | 4 | |
| Speech-to-Text Retrieval | ParaSpeechCaps Global captions (test) | R@145.6 | 4 | |
| Speech-to-Text Retrieval | ParaSpeechCaps Fine-Grained captions (test) | R@168.1 | 4 | |
| Text-to-Speech Retrieval | ParaSpeechCaps Global captions (test) | Recall@140.3 | 4 | |
| Text-to-Speech Retrieval | ParaSpeechCaps Fine-Grained captions (test) | R@167.2 | 4 |