Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

About

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $\beta$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $\pi^{*}_{\beta}(y|x)\propto\pi_{\text{ref}}(y|x)e^{\frac{1}{\beta}r_{\phi}(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $\beta(x)=\frac{\sigma_\phi(x)}{\hat{\sigma}_\theta(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $\beta$ with a prompt-adaptive $\beta(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.

Zhichao Wang• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval
Win Rate78.53
420
Bias EvaluationBBQ
Accuracy71.06
171
Truthfulness EvaluationTruthfulQA--
33
Multistep ReasoningMuSR
Accuracy49.47
31
Science Question AnsweringARC Challenge
Score73.89
17
Coreference ResolutionWinogender
Accuracy64.72
9
Question AnsweringGPQA
GPQA Score37.25
6
Showing 7 of 7 rows

Other info

Follow for update