Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TellWhisper: Tell Whisper Who Speaks When

About

Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when'' and ''who''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu• 2026

Related benchmarks

TaskDatasetResultRank
Speaker DiarizationAISHELL-4
DER (%)4.44
20
Speaker DiarizationAMI
DER8.82
15
Speaker DiarizationRAMC
DER6.48
9
Multi-speaker Automatic Speech RecognitionLibri2Mix
CP-WER14.39
8
Multi-speaker Automatic Speech RecognitionAMI
CP-WER32.53
7
Multi-speaker Automatic Speech RecognitionNotSoFar
CP-WER34.48
7
Multi-speaker Automatic Speech RecognitionLibriCSS
CP-WER9.88
7
Speaker DiarizationAliMeeting
DER4.59
6
Speaker DiarizationMSDWild
DER4.79
6
Speaker DiarizationVoxConverse
DER5.21
6
Showing 10 of 10 rows

Other info

Follow for update