Towards multi-modal forgery representation learning for AI-generated video detection and localization

About

Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.

Dat Le, Khoa Nguyen, Xin Wang, Shu Hu• 2026

Related benchmarks

Task	Dataset	Result
Deepfake Detection	FakeAVCeleb	Video-level AUC0.8303	9
Temporal Forgery Localization	AV-Deepfake1M++	AP@0.552.89	5
Temporal Localization	AV-Deepfake1M++	AUC (Segment)98.23	5
Video-level Deepfake Detection	AV-Deepfake1M++	AUC (Video)96.66	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord