Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses

About

Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.

Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu• 2026

Related benchmarks

TaskDatasetResultRank
Emotion Expression EvaluationChinese (Zh) Emotion Expression (test)
ETT1 Score68.62
6
Emotion PerceptionEmoTransSpeech Zh - One Transition V1 (test)
ETC Accuracy100
5
Emotion PerceptionEmoTransSpeech Zh - Two Transitions V1 (test)
Accuracy ETC100
5
Emotion PerceptionEmoTransSpeech Zh - Three Transitions V1 (test)
Accuracy ETC100
5
Emotion Expression EvaluationEnglish (En) Emotion Expression (test)
ETT1 Score73.24
4
Emotion PerceptionEmoTransSpeech En - One Transition V1 (test)
Accuracy ETC100
3
Emotion PerceptionEmoTransSpeech En - Two Transitions V1 (test)
Accuracy ETC1
3
Emotion PerceptionEmoTransSpeech En Three Transitions V1 (test)
AccETC100
3
Showing 8 of 8 rows

Other info

Follow for update