EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
About
Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Emotion Expression Evaluation | Chinese (Zh) Emotion Expression (test) | ETT1 Score68.62 | 6 | |
| Emotion Perception | EmoTransSpeech Zh - One Transition V1 (test) | ETC Accuracy100 | 5 | |
| Emotion Perception | EmoTransSpeech Zh - Two Transitions V1 (test) | Accuracy ETC100 | 5 | |
| Emotion Perception | EmoTransSpeech Zh - Three Transitions V1 (test) | Accuracy ETC100 | 5 | |
| Emotion Expression Evaluation | English (En) Emotion Expression (test) | ETT1 Score73.24 | 4 | |
| Emotion Perception | EmoTransSpeech En - One Transition V1 (test) | Accuracy ETC100 | 3 | |
| Emotion Perception | EmoTransSpeech En - Two Transitions V1 (test) | Accuracy ETC1 | 3 | |
| Emotion Perception | EmoTransSpeech En Three Transitions V1 (test) | AccETC100 | 3 |