Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

About

As large language models (LLMs) continue to grow in capability, so do the risks of harmful misuse through fine-tuning. While most prior studies assume that attackers rely on supervised fine-tuning (SFT) for such misuse, we systematically demonstrate that reinforcement learning (RL) enables adversaries to more effectively break safety alignment and facilitate more advanced harmful task assistance, under matched computational budgets. To counter this emerging threat, we propose TokenBuncher, the first effective defense specifically targeting RL-based harmful fine-tuning. TokenBuncher suppresses the foundation on which RL relies: model response entropy. By constraining entropy, RL-based fine-tuning can no longer exploit distinct reward signals to drive the model toward harmful behaviors. We realize this defense through entropy-as-reward RL and a Token Noiser mechanism designed to prevent the escalation of harmful capabilities. Extensive experiments across multiple models and RL algorithms show that TokenBuncher robustly mitigates harmful RL fine-tuning while preserving benign task performance and finetunability. Our results highlight that RL-based harmful fine-tuning poses a greater systemic risk than SFT, and that TokenBuncher provides an effective and general defense.

Weitao Feng, Lixu Wang, Peizhuo Lv, Tianyi Wei, Jie Zhang, Chongyang Gao, Sinong Zhan, Wei Dong• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy (Acc)87.6
337
Mathematical Problem SolvingMATH
Accuracy72.8
75
Safety Alignment Breaking PreventionHarmBench
Harmful Score (%)0.00e+0
60
Safety Alignment Breaking PreventionStrongREJECT
Harmful Score (%)0.00e+0
60
Harmful Knowledge EvaluationWMDP evil
WMDP-evil Score11.52
60
Showing 5 of 5 rows

Other info

Follow for update