Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings
About
We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Content Enjoyment (CE) Assessment | AES-Natural | SRCC0.904 | 9 | |
| Audio Content Usefulness (CU) Assessment | AES-Natural | SRCC0.894 | 9 | |
| Audio Production Quality (PQ) Assessment | AES-Natural | SRCC0.896 | 9 | |
| Audio Production Complexity (PC) Assessment | AES-Natural | SRCC0.928 | 9 |