CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
About
The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| AI-generated Video Detection | VideoPhy 1.0 (test) | CVX Score96.93 | 42 | |
| AI-generated Video Detection | EvalCrafter | Floor33 Score99.63 | 42 | |
| AI-generated Video Detection | VidProm | AUC (MS)88.15 | 42 | |
| AI-generated Video Detection | VideoPhy | CVX AUC94.28 | 28 | |
| AI-generated Video Detection | EvalCrafter 14 subsets (test) | Floor33 Score99.7 | 28 | |
| AI-generated Video Detection | GenVideo (test) | Mean Score98.74 | 23 |