Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment
About
Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category-specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling: adjusting rewards by task difficulty, data-level hardness by text encoder cosine similarity, and model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | StrongREJECT | -- | 65 | |
| Harmlessness | Stereotype | Refusal Rate99.03 | 20 | |
| Helpfulness | SimpleQA | Accuracy6.64 | 20 | |
| Helpfulness | AdvGLUE | Accuracy75.15 | 20 | |
| Helpfulness | GSM8K | Accuracy89.7 | 20 | |
| Helpfulness | HHH | Accuracy90.71 | 20 | |
| Harmlessness | XSTest | Refusal Rate99 | 20 | |
| Safety | WildChat | Refusal Rate74.21 | 20 | |
| Jailbreak Resistance | AutoDAN | Refusal Rate59 | 3 | |
| Jailbreak Resistance | GCG | Refusal Rate96.98 | 3 |