Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TellWhisper: Tell Whisper Who Speaks When

About

Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when'' and ''who''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu• 2026

Related benchmarks

TaskDatasetResultRank
Speaker DiarizationAMI
DER8.82
24
Speaker DiarizationAISHELL-4
DER (%)4.44
20
Multi-speaker Automatic Speech RecognitionAMI
CP-WER32.53
11
Speaker DiarizationRAMC
DER6.48
9
Multi-speaker Automatic Speech RecognitionLibri2Mix
CP-WER14.39
8
Multi-speaker Automatic Speech RecognitionNotSoFar
CP-WER34.48
7
Multi-speaker Automatic Speech RecognitionLibriCSS
CP-WER9.88
7
Speaker DiarizationAliMeeting
DER4.59
6
Speaker DiarizationMSDWild
DER4.79
6
Speaker DiarizationVoxConverse
DER5.21
6
Showing 10 of 10 rows

Other info

Follow for update