GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, Video, and Audio
About
We present GuardReasoner-Omni, a reasoning-based guardrail model designed to moderate text, image, video, and audio data. First, we construct a comprehensive training corpus comprising 181k samples spanning these four modalities. Our training pipeline follows a two-stage paradigm to incentivize the model to deliberate before making decisions: (1) conducting SFT to cold-start the model with explicit reasoning capabilities and structural adherence; and (2) performing RL with a concise correctness reward to preserve accurate reasoning while suppressing redundant generation. We release a suite of models scaled at 3B and 7B parameters. Extensive experiments demonstrate that GuardReasoner-Omni achieves superior performance compared to existing state-of-the-art baselines across various guardrail benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Harmfulness Detection | HarmBench | F1 Score87.61 | 100 | |
| Response Harmfulness Detection | XSTEST-RESP | Response Harmfulness F195.48 | 76 | |
| Response Harmfulness Detection | Beavertails | F1 Score86.04 | 59 | |
| Safety Classification | SafeRLHF | F1 Score0.6844 | 48 | |
| Response Harmfulness Classification | WildGuard (test) | F1 (Total)77.57 | 30 | |
| Prompt Harmfulness Detection | Text & Image Benchmarks Average | F1 Score83.84 | 19 | |
| Prompt Harmfulness Detection | UCF-Crime | F1 Score91.67 | 7 | |
| Prompt Harmfulness Detection | XD-Violence | F1 Score96.82 | 7 | |
| Prompt Harmfulness Detection | FVC | F1 Score67.86 | 7 | |
| Prompt Harmfulness Detection | HarmVideo | F1 Score95.5 | 7 |