SAM Audio: Segment Anything in Audio
About
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-prompted separation | Instr pro | SAJ4.82 | 11 | |
| Text-prompted separation | Speech | SAJ4.67 | 9 | |
| Text-prompted separation | Instr(wild) | SAJ4.32 | 9 | |
| Audio separation quality assessment | SAM Audio-Bench Speech | PCC Overall0.363 | 9 | |
| Audio separation quality assessment | SAM Audio-Bench Music | PCC Overall0.228 | 9 | |
| Audio separation quality assessment | SAM Audio-Bench Sound | PCC Overall0.187 | 9 | |
| Text-prompted separation | Speaker | SAJ4.51 | 9 | |
| Text-prompted separation | music | SAJ4.45 | 7 | |
| Text-prompted separation | General SFX | SAJ Score4.35 | 5 | |
| Visual-prompted audio separation | Speaker | IB Score0.24 | 5 |