Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

About

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category-specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling: adjusting rewards by task difficulty, data-level hardness by text encoder cosine similarity, and model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia• 2025

Related benchmarks

TaskDatasetResultRank
Safety EvaluationStrongREJECT--
65
HarmlessnessStereotype
Refusal Rate99.03
20
HelpfulnessSimpleQA
Accuracy6.64
20
HelpfulnessAdvGLUE
Accuracy75.15
20
HelpfulnessGSM8K
Accuracy89.7
20
HelpfulnessHHH
Accuracy90.71
20
HarmlessnessXSTest
Refusal Rate99
20
SafetyWildChat
Refusal Rate74.21
20
Jailbreak ResistanceAutoDAN
Refusal Rate59
3
Jailbreak ResistanceGCG
Refusal Rate96.98
3
Showing 10 of 12 rows

Other info

Follow for update