Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

About

We present GuardReasoner-Omni, a reasoning-based guardrail model designed to moderate text, image, and video data. First, we construct a comprehensive training corpus comprising 148k samples spanning these three modalities. Our training pipeline follows a two-stage paradigm to incentivize the model to deliberate before making decisions: (1) conducting SFT to cold-start the model with explicit reasoning capabilities and structural adherence; and (2) performing RL, incorporating an error-driven exploration reward to incentivize deeper reasoning on hard samples. We release a suite of models scaled at 2B and 4B parameters. Extensive experiments demonstrate that GuardReasoner-Omni achieves superior performance compared to existing state-of-the-art baselines across various guardrail benchmarks. Notably, GuardReasoner-Omni (2B) significantly surpasses the runner-up by 5.3% F1 score.

Zhenhao Zhu, Yue Liu, Yanpei Guo, Wenjie Qu, Cancan Chen, Yufei He, Yibo Li, Yulin Chen, Tianyi Wu, Huiying Xu, Xinzhong Zhu, Jiaheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F195.48
34
Safety ClassificationSafeRLHF
F1 Score0.6844
32
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)77.57
30
Response Harmfulness DetectionHarmBench
F1 Score87.61
23
Prompt Harmfulness DetectionText & Image Benchmarks Average
F1 Score83.84
19
Response Harmfulness DetectionBeavertails
F1 Score86.04
18
Prompt Harmfulness DetectionUCF-Crime
F1 Score91.67
7
Prompt Harmfulness DetectionXD-Violence
F1 Score96.82
7
Prompt Harmfulness DetectionFVC
F1 Score67.86
7
Prompt Harmfulness DetectionHarmVideo
F1 Score95.5
7
Showing 10 of 14 rows

Other info

Follow for update