OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

About

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios. The dataset and code are publicly available at https://github.com/shen8424/OmniVL-Guard.

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong• 2026

Related benchmarks

Task	Dataset	Result
Binary Classification	FSFR	Accuracy90.85	7
Image Localization	FSFR	Localization Score0.5426	7
Text Localization	FSFR	Localization Score63.78	7
Text Localization	Dt In-Domain (test)	F1 Score63.78	7
Video Localization	FSFR	Localization Score59.22	7
Binary Classification	MMFakeBench text-image (Out-Of-Domain)	Accuracy79.38	6
Binary Classification	Dt In-Domain Text-Image (test)	Accuracy75.52	6
Binary Classification	ISOT text (Out-Of-Domain)	Accuracy93.69	5
Binary Classification	CASIA2.0 image (Out-Of-Domain)	Accuracy0.6364	5
Binary Classification	FakeSV text-video (Out-Of-Domain)	Accuracy63.55	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord