Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

About

This paper proposes \textit{Group-relative Implicit Fine-Tuning (GIFT)}, a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning. GIFT combines three key elements: (1) group-based sampling and normalization from GRPO, (2) the implicit reward formulation of DPO, and (3) the training principle underlying UNA. The central idea is to transform reward maximization into a \textit{group-wise reward matching problem}. By jointly normalizing implicit and explicit rewards within each sampled group, GIFT eliminates the intractable normalization constant associated with implicit rewards and reduces sensitivity to the KL-regularization coefficient through normalization. This yields a simple mean squared error (MSE) objective between normalized implicit and explicit reward functions, providing a stable and analytically tractable training signal. Unlike offline approaches such as DPO and UNA, GIFT retains on-policy exploration through on-policy response sampling. Compared to GRPO, it replaces high-variance reward maximization with structured reward matching, simplifying optimization and reducing sensitivity to hyperparameters. GIFT is evaluated across both RLHF and RLVR settings on models ranging from 7B to 32B parameters. Results show that GIFT converges faster, generalizes better with reduced overfitting, and outperforms GRPO on mathematical reasoning benchmarks (GSM8K, MATH, AIME) as well as generation tasks' evaluations (AlpacaEval and Arena-Hard).

Zhichao Wang• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval
Win Rate78.53
227
Bias EvaluationBBQ
Accuracy71.06
113
Truthfulness EvaluationTruthfulQA--
33
Multistep ReasoningMuSR
Accuracy49.47
31
Science Question AnsweringARC Challenge
Score73.89
12
Coreference ResolutionWinogender
Accuracy64.72
9
Question AnsweringGPQA
GPQA Score37.25
6
Showing 7 of 7 rows

Other info

Follow for update