Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

About

Representation engineering (RepE) defenses have shown strong robustness against jailbreak attacks on large language models (LLMs). However, these methods fundamentally rely on black-list supervision: they learn jailbreak-to-refusal activation transformations from harmful or jailbreak data that are inherently incomplete and continuously evolving. Hence, the performance of RepE-based defenses becomes tightly coupled to the quality and coverage of collected harmful samples, leaving models vulnerable to unseen attacks. This reliance also obscures the distinction between defenses that fit known harmful distributions and defenses that protect a benign latent region without estimating the harmful distribution. We adopt the opposite, the white-list perspective, by leveraging the accessibility and abundance of benign data. The goal is to elicit refusal on arbitrary inputs while ensuring that harmless inputs are not falsely rejected. This shifts the core research question to: How can we design a robust benign-latent preservation mechanism such that the benign latent distribution remains intact while refusal is elicited? To answer this, we propose Ellipsoid Control, a test-time defense. It performs projected gradient descent that can elicit refusal on arbitrary inputs, aiming to improve defense effectiveness. At the same time, an anisotropic benign-geometry ellipsoid is fitted from abundant benign data to constrain the update to minimize distortion of the benign latent geometry. This tight constraint helps preserve model utility. Across multiple LLMs, jailbreak attacks, benign tasks, and safety-boundary evaluations, Ellipsoid Control consistently enhances safety while better preserving utility, demonstrating the effectiveness of the white-list approach for jailbreak defense

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu, Ahmed Asiri, Shui Yu• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench--
557
Multi-turn Dialogue EvaluationMT-Bench
Overall Score8.7
532
General Knowledge Question AnsweringMMLU
Accuracy79.4
50
Safety Boundary Over-RefusalORBench
ORBench Over-Refusal Rate14.5
20
Showing 4 of 4 rows

Other info

Follow for update