Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
About
As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.
Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2.04 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER3.92 | 1151 | |
| Automatic Speech Recognition | TED-LIUM3 (test) | WER10.66 | 59 | |
| Pause Handling | Full-Duplex-Bench Candor | TOR1 | 13 | |
| Full-duplex Speech Interaction Latency Analysis | Full-Duplex-Bench v1.5 | Stop Latency (Mean)0.68 | 8 | |
| Overlap Handling Evaluation | Full-Duplex-Bench Background Speech v1.5 | STOI0.98 | 2 | |
| Overlap Handling Evaluation | Full-Duplex-Bench Talking to Other v1.5 | STOI0.96 | 2 | |
| Overlap Handling Evaluation | Full-Duplex-Bench User Backchannel v1.5 | STOI91 | 2 | |
| Overlap Handling Evaluation | Full-Duplex-Bench User Interruption v1.5 | STOI0.97 | 2 | |
| Pause Handling | Synthetic | TOR1 | 2 |
Showing 10 of 13 rows