Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

About

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.

Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Raj• 2026

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	ASVSpoof	Accuracy (ASV)98.7	9
Audio Deepfake Detection	CosyFish	Accuracy (add)95.1	9
Automatic Speaker Verification	ASVSpoof	Accuracy (ASV)75.1	9
Automatic Speaker Verification	CosyFish	Accuracy (ASV)62.5	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord