Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

About

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER3.92	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.04	1410
Automatic Speech Recognition	TED-LIUM3 (test)	WER10.66	88
Pause Handling	Full-Duplex-Bench Candor	TOR1	19
Full-duplex Speech Interaction Latency Analysis	Full-Duplex-Bench v1.5	Stop Latency (Mean)0.68	8
Smooth Turn Taking	CANDOR	TOR1	8
User Interruption	Full-Duplex-Bench 1.0	TOR0.858	8
Backchannel	Full-Duplex-Bench 1.0	TOR0.291	7
Overlap Handling Evaluation	Full-Duplex-Bench Background Speech v1.5	STOI0.98	2
Overlap Handling Evaluation	Full-Duplex-Bench Talking to Other v1.5	STOI0.96	2

Showing 10 of 13 rows

Other info

GitHub

Follow for update

@wizwand_team Discord