Towards multi-modal forgery representation learning for AI-generated video detection and localization
About
Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Deepfake Detection | FakeAVCeleb | Video-level AUC0.8303 | 9 | |
| Temporal Forgery Localization | AV-Deepfake1M++ | AP@0.552.89 | 5 | |
| Temporal Localization | AV-Deepfake1M++ | AUC (Segment)98.23 | 5 | |
| Video-level Deepfake Detection | AV-Deepfake1M++ | AUC (Video)96.66 | 5 |