Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

About

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($\Delta$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto• 2026

Related benchmarks

TaskDatasetResultRank
Safety Refusal EvaluationPolyRefuse 1.0 (sw)
Harmful Refusal Rate96.7
21
Safety Refusal EvaluationPolyRefuse 1.0 (km)
Harmful Refusal Rate90.9
21
Safety Refusal EvaluationPolyRefuse 1.0 (si)
Harmful Refusal Rate90.2
21
Safety Refusal EvaluationPolyRefuse yo 1.0
Harmful Refusal Rate17.7
21
Safety Refusal EvaluationPolyRefuse am 1.0
Harmful Refusal Rate91.6
21
Safety Refusal EvaluationPolyRefuse 1.0 (my)
Harmful Refusal Rate85.1
21
Showing 6 of 6 rows

Other info

Follow for update