Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

About

Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.

Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C. Puvvada, Jagadeesh Balam, Boris Ginsburg• 2024

Related benchmarks

TaskDatasetResultRank
Multi-speaker Automatic Speech RecognitionLibri2Mix
CP-WER14.62
8
Multi-speaker Automatic Speech RecognitionAMI
CP-WER34.24
7
Multi-speaker Automatic Speech RecognitionNotSoFar
CP-WER36.54
7
Multi-speaker Automatic Speech RecognitionLibriCSS
CP-WER12.16
7
Showing 4 of 4 rows

Other info

Follow for update