BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

About

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng• 2025

Related benchmarks

Task	Dataset	Result
Synthetic Video Detection	GenBuster-Bench OOD 2025	Detection Rate (Sora)96	27
Synthetic Video Detection	GenBuster-Bench Wild 2026	Fake Detection Rate70.7	27
Synthetic Video Detection	GenBuster-Bench ID 2024	Real Accuracy87	27
Synthetic Image Detection	FakeClue++	Accuracy76.2	16
Video Forgery Detection	ID, OOD, and OOD-MintVid Aggregated	Average Score74.7	16
Video Forgery Detection	OOD (Out-of-Domain) Video	Vidu Q170.9	16
Video Forgery Detection	Video Datasets ID (In-Domain) GenBuster++, LOKI	GenBuster++ Score77.1	16
Video Forgery Detection	MintVid OOD	Fact Score88.8	16
Reasoning evaluation	DeepfakeJudge Reason 1.0 (test)	BLEU-15	16
Deepfake Detection	DeepfakeJudge-Detect (test)	Accuracy (Real)49.9	15

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord