Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models

About

As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.

Junyu Lin, Meizhen Liu, Xiufeng Huang, Jinfeng Li, Haiwen Hong, Xiaohan Yuan, Yuefeng Chen, Longtao Huang, Hui Xue, Ranjie Duan, Zhikai Chen, Yuchuan Fu, Defeng Li, Lingyao Gao, Yitong Yang• 2026

Related benchmarks

TaskDatasetResultRank
Prompt ClassificationOverR
F1 Score41.9
16
Prompt ClassificationSEval
F1 Score92.5
16
Response Safety ClassificationMultilingual Safety Benchmarks Response-side (test)
PolyGuard F179.2
16
Safe CompletionSEval Prompt 2.0
F1 Score93.7
16
Safe CompletionSEval Response 2.0
F1 Score80.8
16
Safety EvaluationAttack Benchmarks Prompt
StrongR100
16
Safety EvaluationAttack Benchmarks Response
SEvalA82.3
16
Prompt ClassificationAegis 2.0
F1 Score87.1
16
Prompt ClassificationSimpST
F1 Score100
16
Prompt ClassificationXSTest
F1 Score94.4
16
Showing 10 of 19 rows

Other info

Follow for update