Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

About

Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.

Chien-Chun Wang, Kuan-Tang Huang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen• 2025

Related benchmarks

TaskDatasetResultRank
Audio Production Complexity (PC) AssessmentAES-Natural
SRCC0.942
9
Audio Content Enjoyment (CE) AssessmentAES-Natural
SRCC0.869
9
Audio Content Usefulness (CU) AssessmentAES-Natural
SRCC0.852
9
Audio Production Quality (PQ) AssessmentAES-Natural
SRCC0.883
9
Showing 4 of 4 rows

Other info

Follow for update