Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

About

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath• 2025

Related benchmarks

TaskDatasetResultRank
Toxicity mitigation evaluationOffensive category
RTR50.8
8
Toxicity mitigation evaluationSpecialized category
RTR59.6
8
Toxicity MitigationSpecialized category Manually-designed jailbreak attacks
RTR4.5
3
Adversarial Toxicity RefusalLLaMA-2 Chatbot Offensive category
RTR0.2
3
Adversarial Toxicity RefusalLLaMA-2 Chatbot Specialized category
Refusal Rate (RTR)1.8
3
Toxicity MitigationDailyDialog
RTR (Backdoor Non-toxic)0.02
3
Showing 6 of 6 rows

Other info

Follow for update