Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Test-Time Detoxification without Training or Learning Anything

About

Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

Baturay Saglam, Dionysis Kalogerias• 2026

Related benchmarks

TaskDatasetResultRank
Toxicity MitigationRealToxicityPrompts challenging
Avg Toxicity (Max)14.4
46
DetoxificationRealToxicityPrompts challenging
Max Toxicity0.151
32
DetoxificationAttaQ benchmark
Avg Toxicity (Max)0.108
32
DetoxificationBoLD
Toxicity (Max)3.2
28
Toxicity EvaluationBOLD 23679 prompts (test)
Avg Toxicity (Max)0.038
18
Toxicity EvaluationBoLD
Avg Toxicity (Max)0.025
14
Toxicity EvaluationAttaQ
Max Toxicity Score0.076
14
Toxicity EvaluationAttaQ 1402 prompts (test)
Max Toxicity Score0.086
14
Showing 8 of 8 rows

Other info

Follow for update