Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

About

Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg• 2024

Related benchmarks

Task	Dataset	Result
Multi-speaker Automatic Speech Recognition	AMI	CP-WER34.24	11
Multi-speaker Automatic Speech Recognition	Libri2Mix	CP-WER14.62	8
Multi-speaker Automatic Speech Recognition	NotSoFar	CP-WER36.54	7
Multi-speaker Automatic Speech Recognition	LibriCSS	CP-WER12.16	7
Speaker-attributed Automatic Speech Recognition	Fisher Global Meeting-level	DER15.21	4
Speaker-attributed Automatic Speech Recognition	Candor Global Meeting-level	DER18.03	4
Speaker-attributed Automatic Speech Recognition	Fisher (local setting)	DER18.33	4
Speaker-attributed Automatic Speech Recognition	Candor (local setting)	DER30.92	4
Speaker-attributed Automatic Speech Recognition	MLC local setting	DER17.76	4
Speaker-attributed Automatic Speech Recognition	MLC Global Meeting-level	DER21.92	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord