Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

About

To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/

Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi• 2025

Related benchmarks

TaskDatasetResultRank
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F192.72
34
Safety ClassificationSafeRLHF
F1 Score0.6663
32
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)79.48
30
Response ClassificationBeaverTails V Text-Image Response
F1 Score82.35
23
Response Harmfulness DetectionHarmBench
F1 Score86.82
23
Prompt Harmfulness DetectionText & Image Benchmarks Average
F1 Score79.36
19
Response Harmfulness DetectionBeavertails
F1 Score85.19
18
Response ClassificationWild Guard Text Response
F1 Score93.07
16
Response ClassificationXSTest Text Response
F1 Score96.38
16
Response ClassificationAegis Text Response 2.0
F1 Score74.64
16
Showing 10 of 27 rows

Other info

Follow for update