CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation
About
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hallucination Detection | BRACE Hallucination 1.0 (test) | AudioCaps Score97.96 | 46 | |
| Text-to-Audio evaluation | RELATE (test) | LCC0.54 | 38 | |
| Text-to-Audio evaluation | PAM (test) | LCC0.609 | 36 | |
| Audio Captioning Evaluation | BRACE Main 1.0 | AudioCaps-Main HH Score67.63 | 26 |