Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

About

The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.

Leying Zhang, Bowen Shi, Haibin Wu, Bach Viet Do, Yanmin Qian• 2026

Related benchmarks

TaskDatasetResultRank
Speech Quality AssessmentQualiSpeech
PCC - Noise0.668
14
Speech Quality AssessmentSpeechEval
PCC - Overall0.662
14
Musical Quality AssessmentM-Ovrl
PCC0.642
11
ASMR Speech Quality AssessmentAsmrMOS
PCC0.297
11
Music Textual AlignmentM-TA (Music Textual Alignment)
PCC0.487
11
Synthesized Speech Quality AssessmentSynMOS
PCC0.496
11
Audio Aesthetics EvaluationAudiobox Aesthetics (AES) Music
PCC (CE)0.749
10
Audio Aesthetics EvaluationAudiobox Aesthetics (AES) Speech
PCC (CE)0.531
10
Audio Aesthetics EvaluationAudiobox Aesthetics (AES) Sound
PCC (CE)0.542
10
Showing 9 of 9 rows

Other info

Follow for update