GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
About
We present GuardReasoner-Omni, a reasoning-based guardrail model designed to moderate text, image, and video data. First, we construct a comprehensive training corpus comprising 148k samples spanning these three modalities. Our training pipeline follows a two-stage paradigm to incentivize the model to deliberate before making decisions: (1) conducting SFT to cold-start the model with explicit reasoning capabilities and structural adherence; and (2) performing RL, incorporating an error-driven exploration reward to incentivize deeper reasoning on hard samples. We release a suite of models scaled at 2B and 4B parameters. Extensive experiments demonstrate that GuardReasoner-Omni achieves superior performance compared to existing state-of-the-art baselines across various guardrail benchmarks. Notably, GuardReasoner-Omni (2B) significantly surpasses the runner-up by 5.3% F1 score.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Harmfulness Detection | XSTEST-RESP | Response Harmfulness F195.48 | 34 | |
| Safety Classification | SafeRLHF | F1 Score0.6844 | 32 | |
| Response Harmfulness Classification | WildGuard (test) | F1 (Total)77.57 | 30 | |
| Response Harmfulness Detection | HarmBench | F1 Score87.61 | 23 | |
| Prompt Harmfulness Detection | Text & Image Benchmarks Average | F1 Score83.84 | 19 | |
| Response Harmfulness Detection | Beavertails | F1 Score86.04 | 18 | |
| Prompt Harmfulness Detection | UCF-Crime | F1 Score91.67 | 7 | |
| Prompt Harmfulness Detection | XD-Violence | F1 Score96.82 | 7 | |
| Prompt Harmfulness Detection | FVC | F1 Score67.86 | 7 | |
| Prompt Harmfulness Detection | HarmVideo | F1 Score95.5 | 7 |