Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Audio Language Model for Deepfake Detection Grounded in Acoustic Chain-of-Thought

About

Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.

Runkun Chen, Yixiong Fang, Pengyu Chang, Yuante Li, Massa Baali, Bhiksha Raj• 2026

Related benchmarks

TaskDatasetResultRank
Audio Deepfake DetectionASVSpoof
Accuracy (ASV)98.7
9
Audio Deepfake DetectionCosyFish
Accuracy (add)95.1
9
Automatic Speaker VerificationASVSpoof
Accuracy (ASV)75.1
9
Automatic Speaker VerificationCosyFish
Accuracy (ASV)62.5
9
Showing 4 of 4 rows

Other info

Follow for update