Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs

About

Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.

Yibo Wang, Hai-Long Sun, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy84.6
1460
Commonsense ReasoningWinoGrande
Accuracy75.3
776
Instruction FollowingIFEval
Accuracy (0-100)36.9
292
MathGSM8K
Accuracy0.4592
87
Multi-Domain KnowledgeMMLU
MMLU Multi-Domain Knowledge Acc58.55
44
Common Sense ReasoningBBH
Accuracy45.05
27
Multi-Domain KnowledgeMMLU-Pro
Performance31.42
24
Multi-Domain KnowledgeGPQA
Performance31.06
24
Math & LogicMuSR
MUSR Performance40.2
24
Commonsense ReasoningBigBenchHard
Accuracy45.27
18
Showing 10 of 14 rows

Other info

Follow for update