Latent Adversarial Regularization for Offline Preference Optimization
About
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math | GSM8K | Accuracy0.4867 | 87 | |
| Knowledge | MMLU | Accuracy56.93 | 71 | |
| Factuality | TruthfulQA | Accuracy55.67 | 18 | |
| Preference Alignment | AlpacaEval weighted gpt4 turbo 2.0 | Win Rate46.11 | 8 | |
| Reasoning | ANLI R3 | Accuracy48.25 | 3 |