Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

About

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.04
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER3.92
1151
Automatic Speech RecognitionTED-LIUM3 (test)
WER10.66
59
Pause HandlingFull-Duplex-Bench Candor
TOR1
13
Full-duplex Speech Interaction Latency AnalysisFull-Duplex-Bench v1.5
Stop Latency (Mean)0.68
8
Overlap Handling EvaluationFull-Duplex-Bench Background Speech v1.5
STOI0.98
2
Overlap Handling EvaluationFull-Duplex-Bench Talking to Other v1.5
STOI0.96
2
Overlap Handling EvaluationFull-Duplex-Bench User Backchannel v1.5
STOI91
2
Overlap Handling EvaluationFull-Duplex-Bench User Interruption v1.5
STOI0.97
2
Pause HandlingSynthetic
TOR1
2
Showing 10 of 13 rows

Other info

GitHub

Follow for update