Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

About

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment. To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce Trajectory-Consistent Summative Reward (TCSR), which aggregates the historical minimum and average of turn rewards so that any low-quality turn affects the trajectory-level return. I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2-10 turns. II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 $\rightarrow$ 81.84/70.77 for 3B; 56.21/60.32 $\rightarrow$ 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 $\rightarrow$ 55.58/70.27 for 3B; 24.66/46.48 $\rightarrow$ 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone. Code is available at https://anonymous.4open.science/r/SaFeR-Steer

Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng• 2026

Related benchmarks

TaskDatasetResultRank
Multi-turn MLLM Safety EvaluationSTEER-MMSafe Multi-turn
Safety Score77.04
18
Multi-turn MLLM Safety EvaluationSTEER-VLS Multi-turn
Safety Score65.96
18
Multi-turn MLLM Safety EvaluationSTEER-SPA Multi-turn
Safety66.67
18
Multi-turn MLLM Safety EvaluationSTEER-DyS Multi-turn
Safety Score61.36
18
Multi-turn MLLM Safety EvaluationSTEER Average Multi-turn
Safety Score64.89
18
Safety & Helpfulness EvaluationBeavertails
Safety Score92.03
18
Safety & Helpfulness EvaluationMM-Safety
Safety Score88.71
18
Safety & Helpfulness EvaluationSPA-VL
Safety Score85.8
18
Safety & Helpfulness EvaluationVLGuard
Safety Score88.48
18
Safety & Helpfulness EvaluationVLSBench
Safety Score84.41
18
Showing 10 of 11 rows

Other info

Follow for update