Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes

About

The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.

Gautam Siddharth Kashyap, Harsh Joshi, Niharika Jain, Ebad Shabbir, Jiechao Gao, Nipun Joshi, Usman Naseem• 2026

Related benchmarks

TaskDatasetResultRank
Audio Deepfake DetectionASVspoof 2019
EER0.21
25
Video Deepfake DetectionCeleb-DF (CDF)--
21
Audio-Visual Deepfake DetectionFakeAVCeleb
Accuracy98.75
11
Audio-Visual Deepfake DetectionDeepFake Detection Challenge (DFDC)
Accuracy96.5
11
Video Deepfake DetectionWildDeepfake (WD)
Accuracy85
8
Audio Deepfake DetectionDE-CRO
EER0.01
6
Showing 6 of 6 rows

Other info

Follow for update