BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM
About
Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a framework for unified detection and explanation of synthetic image and video, with a direct reinforcement learning (RL) post-training strategy. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a unified benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Forgery Detection | ID, OOD, and OOD-MintVid Aggregated | Average Score74.7 | 16 | |
| Video Forgery Detection | OOD (Out-of-Domain) Video | Vidu Q170.9 | 16 | |
| Video Forgery Detection | Video Datasets ID (In-Domain) GenBuster++, LOKI | GenBuster++ Score77.1 | 16 | |
| Video Forgery Detection | MintVid OOD | Fact Score88.8 | 16 | |
| Reasoning evaluation | DeepfakeJudge Reason 1.0 (test) | BLEU-15 | 16 | |
| Deepfake Detection | DeepfakeJudge-Detect (test) | Accuracy (Real)49.9 | 15 | |
| AI-generated Video Detection | ViF-Bench T2V 1.0 (test) | Accuracy (Acc)56.9 | 13 |