GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

About

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $\beta$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $\pi^{*}_{\beta}(y|x)\propto\pi_{\text{ref}}(y|x)e^{\frac{1}{\beta}r_{\phi}(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $\beta(x)=\frac{\sigma_\phi(x)}{\hat{\sigma}_\theta(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $\beta$ with a prompt-adaptive $\beta(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.

Zhichao Wang• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate78.53	420
Bias Evaluation	BBQ	Accuracy71.06	171
Truthfulness Evaluation	TruthfulQA	--	33
Multistep Reasoning	MuSR	Accuracy49.47	31
Science Question Answering	ARC Challenge	Score73.89	17
Coreference Resolution	Winogender	Accuracy64.72	9
Question Answering	GPQA	GPQA Score37.25	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord