TellWhisper: Tell Whisper Who Speaks When

About

Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when'' and ''who''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu• 2026

Related benchmarks

Task	Dataset	Result
Speaker Diarization	AMI	DER8.82	24
Speaker Diarization	AISHELL-4	DER (%)4.44	20
Multi-speaker Automatic Speech Recognition	AMI	CP-WER32.53	11
Speaker Diarization	RAMC	DER6.48	9
Multi-speaker Automatic Speech Recognition	Libri2Mix	CP-WER14.39	8
Multi-speaker Automatic Speech Recognition	NotSoFar	CP-WER34.48	7
Multi-speaker Automatic Speech Recognition	LibriCSS	CP-WER9.88	7
Speaker Diarization	AliMeeting	DER4.59	6
Speaker Diarization	MSDWild	DER4.79	6
Speaker Diarization	VoxConverse	DER5.21	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord