REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
About
While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Massive Multitask Language Understanding | MMLU-Pro | Accuracy (MMLU-Pro)45.2 | 115 | |
| Mathematical Reasoning | GSM8K | Accuracy90.15 | 38 | |
| Safety Evaluation | WildChat | Safe@187.8 | 34 | |
| Harmfulness Evaluation | PAIR | -- | 22 | |
| Safety Evaluation | StrongREJECT | Safety Score89.31 | 21 | |
| Question Answering | SimpleQA | Accuracy6.45 | 20 | |
| Safety Evaluation | XS (test) | Safety Score100 | 16 | |
| Safety Evaluation | Do-Not | Safety Score89.46 | 16 | |
| Safety Evaluation | AutoDAN | Safety Score100 | 16 | |
| Safety Evaluation | GCG | Safety Score95.96 | 16 |