BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation
About
As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Deepfake Detection | FakeAVCeleb (test) | Accuracy96.3 | 54 | |
| Synthetic Video Detection | GenBuster-Bench Wild 2026 | Fake Detection Rate76 | 14 | |
| Synthetic Video Detection | GenBuster-Bench OOD 2025 | Detection Rate (Sora)81.5 | 14 | |
| Synthetic Video Detection | GenBuster-Bench ID 2024 | Real Accuracy87 | 14 | |
| Video Forgery Detection | GVF | Accuracy (Show1)89.32 | 7 |