Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PAM: Prompting Audio-Language Models for Audio Quality Assessment

About

While audio quality is a key performance metric for various audio processing tasks, including generative modeling, its objective measurement remains a challenge. Audio-Language Models (ALMs) are pre-trained on audio-text pairs that may contain information about audio quality, the presence of artifacts, or noise. Given an audio input and a text prompt related to quality, an ALM can be used to calculate a similarity score between the two. Here, we exploit this capability and introduce PAM, a no-reference metric for assessing audio quality for different audio processing tasks. Contrary to other "reference-free" metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate the reliability of PAM against established metrics and human listening scores on four tasks: text-to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled distortions, in-the-wild setups, and prompt choices. Our evaluation shows that PAM correlates well with existing metrics and human listening scores. These results demonstrate the potential of ALMs for computing a general-purpose audio quality metric.

Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang• 2024

Related benchmarks

TaskDatasetResultRank
Audio Assessment CorrelationPAM
LCC0.5873
38
Musicality EvaluationMusicEval (test)
LCC0.6466
15
Musicality EvaluationCMI-Pref
Accuracy0.654
15
Musicality EvaluationMusic Arena
Accuracy0.6313
15
Showing 4 of 4 rows

Other info

Follow for update