SongEval: A Benchmark Dataset for Song Aesthetics Evaluation

About

Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspects that define musical appeal. To address this issue, we introduce SongEval, the first open-source, large-scale benchmark dataset for evaluating the aesthetics of full-length songs. SongEval includes over 2,399 songs in full length, summing up to more than 140 hours, with aesthetic ratings from 16 professional annotators with musical backgrounds. Each song is evaluated across five key dimensions: overall coherence, memorability, naturalness of vocal breathing and phrasing, clarity of song structure, and overall musicality. The dataset covers both English and Chinese songs, spanning nine mainstream genres. Moreover, to assess the effectiveness of song aesthetic evaluation, we conduct experiments using SongEval to predict aesthetic scores and demonstrate better performance than existing objective evaluation metrics in predicting human-perceived musical quality.

Jixun Yao, Guobin Ma, Huixin Xue, Huakang Chen, Chunbo Hao, Yuepeng Jiang, Haohe Liu, Ruibin Yuan, Jin Xu, Wei Xue, Hao Liu, Lei Xie• 2025

Related benchmarks

Task	Dataset	Result
Audio Assessment Correlation	PAM	LCC0.6987	45
Musicality Evaluation	MusicEval (test)	SRCC0.6949	26
Musicality Evaluation	Music Arena	Accuracy0.7388	15
Musicality Evaluation	CMI-Pref	Accuracy0.724	15
Music Musicality Assessment	CMI-RewardBench (PAM)	SRCC0.6977	11
Music Preference Prediction	CMI-RewardBench (Music Arena)	Pairwise Accuracy73.88	11
Music Preference Prediction	RewardBench (CMI-Pref)	Pairwise Accuracy72.4	11

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord