TellWhisper: Tell Whisper Who Speaks When
About
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when'' and ''who''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker Diarization | AISHELL-4 | DER (%)4.44 | 20 | |
| Speaker Diarization | AMI | DER8.82 | 15 | |
| Speaker Diarization | RAMC | DER6.48 | 9 | |
| Multi-speaker Automatic Speech Recognition | Libri2Mix | CP-WER14.39 | 8 | |
| Multi-speaker Automatic Speech Recognition | AMI | CP-WER32.53 | 7 | |
| Multi-speaker Automatic Speech Recognition | NotSoFar | CP-WER34.48 | 7 | |
| Multi-speaker Automatic Speech Recognition | LibriCSS | CP-WER9.88 | 7 | |
| Speaker Diarization | AliMeeting | DER4.59 | 6 | |
| Speaker Diarization | MSDWild | DER4.79 | 6 | |
| Speaker Diarization | VoxConverse | DER5.21 | 6 |