Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

About

Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based (training a reward model on preference pairs and optimizing with reinforcement learning) or reward-free (directly fine-tuning on ranked outputs). Recent research shows that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, there still exist two key challenges: (1) imbalanced safety datasets that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning. We first train category-specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL. Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling: adjusting rewards by task difficulty, data-level hardness by text encoder cosine similarity, and model-level responsiveness by reward gaps. Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.

Ruoxi Cheng, Haoxuan Ma, Weixin Wang, Ranjie Duan, Jiexi Liu, Xiaoshuang Jia, Simeng Qin, Xiaochun Cao, Yang Liu, Xiaojun Jia• 2025

Related benchmarks

Task	Dataset	Result
Safety Evaluation	StrongREJECT	--	77
Harmlessness	Stereotype	Refusal Rate99.03	20
Helpfulness	SimpleQA	Accuracy6.64	20
Helpfulness	AdvGLUE	Accuracy75.15	20
Helpfulness	GSM8K	Accuracy89.7	20
Helpfulness	HHH	Accuracy90.71	20
Harmlessness	XSTest	Refusal Rate99	20
Safety	WildChat	Refusal Rate74.21	20
Jailbreak Resistance	AutoDAN	Refusal Rate59	3
Jailbreak Resistance	GCG	Refusal Rate96.98	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord