GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
About
This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $\beta$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $\pi^{*}_{\beta}(y|x)\propto\pi_{\text{ref}}(y|x)e^{\frac{1}{\beta}r_{\phi}(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $\beta(x)=\frac{\sigma_\phi(x)}{\hat{\sigma}_\theta(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $\beta$ with a prompt-adaptive $\beta(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval | Win Rate78.53 | 420 | |
| Bias Evaluation | BBQ | Accuracy71.06 | 171 | |
| Truthfulness Evaluation | TruthfulQA | -- | 33 | |
| Multistep Reasoning | MuSR | Accuracy49.47 | 31 | |
| Science Question Answering | ARC Challenge | Score73.89 | 17 | |
| Coreference Resolution | Winogender | Accuracy64.72 | 9 | |
| Question Answering | GPQA | GPQA Score37.25 | 6 |