Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

About

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. The code is available at https://github.com/liujilong0116/ShaPO.

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua• 2026

Related benchmarks

Task	Dataset	Result
Safety Alignment	HH-RLHF	MD Rate1.09	68
Safety Alignment	Salad Bench	MD0.68	68
Safety Alignment	Do-Not-Answer	MD0.00e+0	52
Safety Alignment	PKU-SafeRLHF 30K (IID)	WR89.26	36
Safety Alignment	Average (Do-Not-Answer, HarmBench, HH-RLHF, Salad Bench)	Aggregate Score0.59	18
Safety Alignment	HarmBench	MD Score0.5	18

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord