Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SAM Audio: Segment Anything in Audio

About

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Doll\'ar, Wei-Ning Hsu, Ann Lee• 2025

Related benchmarks

TaskDatasetResultRank
Text-prompted separationInstr pro
SAJ4.82
11
Text-prompted separationSpeech
SAJ4.67
9
Text-prompted separationInstr(wild)
SAJ4.32
9
Audio separation quality assessmentSAM Audio-Bench Speech
PCC Overall0.363
9
Audio separation quality assessmentSAM Audio-Bench Music
PCC Overall0.228
9
Audio separation quality assessmentSAM Audio-Bench Sound
PCC Overall0.187
9
Text-prompted separationSpeaker
SAJ4.51
9
Text-prompted separationmusic
SAJ4.45
7
Text-prompted separationGeneral SFX
SAJ Score4.35
5
Visual-prompted audio separationSpeaker
IB Score0.24
5
Showing 10 of 16 rows

Other info

GitHub

Follow for update