Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

About

Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.

Zhennan Lin, Shuai Wang, Zhaokai Sun, Pengyuan Xie, Chuan Xie, Jie Liu, Qiang Zhang, Lei Xie• 2026

Related benchmarks

TaskDatasetResultRank
Speaker-attributed Automatic Speech RecognitionAISHELL-4 (test)
CER13.83
18
Speaker-attributed Automatic Speech RecognitionAlimeeting Far (test)
CER20.34
14
Speaker Attribute PredictionAISHELL4 Eval (test)
Accuracy (ACC)96.8
3
Speaker-attributed Automatic Speech RecognitionAISHELL4 Long-form (test)
DER21.6
2
Showing 4 of 4 rows

Other info

Follow for update