Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

About

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy90.13
1455
Multimodal Capability EvaluationMM-Vet--
345
Multi-discipline Multimodal UnderstandingMMMU
Accuracy51.33
317
Multimodal Capability EvaluationMM-Star
Average Score60.4
36
Multimodal Safety EvaluationSPA-VL
Safety Score91.89
26
Multimodal Safety EvaluationMM-SafetyBench
Safety Score2.73
22
Multimodal Safety EvaluationToolkitBench
Safety Score2.49
22
Multimodal Safety EvaluationBeaverTails V
Safety Score2.85
22
Multimodal Safety EvaluationMSSBench
Safety Score2.39
22
Visual Mathematical ReasoningMathVista 1.0 (testmini)
Accuracy66.5
18
Showing 10 of 14 rows

Other info

Follow for update