Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection

About

Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI-ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental results on the ASVspoof 2019LA and 2021LA datasets demonstrate that the proposed EAI-ADD outperforms baselines, providing a more effective solution for audio anti spoofing detection.

Jinhua Zhang, Zhenqi Jia, Rui Liu• 2026

Related benchmarks

Task	Dataset	Result	Rank
Spoof Speech Detection	ASVspoof LA 2021 (eval)	min-tDCF0.2533		36
Anti-spoofing	ASVspoof LA 2019 (evaluation)	t-DCF0.011		7

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord