Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes

About

The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.

Gautam Siddharth Kashyap, Harsh Joshi, Niharika Jain, Ebad Shabbir, Jiechao Gao, Nipun Joshi, Usman Naseem• 2026

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	ASVspoof 2019	EER0.21	37
Video Deepfake Detection	Celeb-DF (CDF)	--	21
Audio-Visual Deepfake Detection	FakeAVCeleb	Accuracy98.75	17
Audio-Visual Deepfake Detection	DeepFake Detection Challenge (DFDC)	Accuracy96.5	11
Video Deepfake Detection	WildDeepfake (WD)	Accuracy85	8
Audio Deepfake Detection	DE-CRO	EER0.01	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord