Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought
About
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Deepfake Detection | ASVSpoof | Accuracy (ASV)98.7 | 9 | |
| Audio Deepfake Detection | CosyFish | Accuracy (add)95.1 | 9 | |
| Automatic Speaker Verification | ASVSpoof | Accuracy (ASV)75.1 | 9 | |
| Automatic Speaker Verification | CosyFish | Accuracy (ASV)62.5 | 9 |