REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

About

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang• 2026

Related benchmarks

Task	Dataset	Result
Massive Multitask Language Understanding	MMLU-Pro	Accuracy (MMLU-Pro)45.2	122
Mathematical Reasoning	GSM8K	Accuracy90.15	80
Safety Evaluation	WildChat	Safe@187.8	34
Harmfulness Evaluation	PAIR	--	22
Safety Evaluation	StrongREJECT	Safety Score89.31	21
Question Answering	SimpleQA	Accuracy6.45	20
Safety Evaluation	XS (test)	Safety Score100	16
Safety Evaluation	Do-Not	Safety Score89.46	16
Safety Evaluation	AutoDAN	Safety Score100	16
Safety Evaluation	GCG	Safety Score95.96	16

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord