Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

About

We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

Hangyu Mao, Guangting Dong, Zhicheng Dou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy92.2
983
Mathematical ReasoningMATH
Accuracy88.8
643
Mathematical ReasoningAIME24
Accuracy30
130
Knowledge-intensive reasoningWebWalker
F1 Score30.5
18
Knowledge-intensive reasoningHotpotQA
F1 Score0.654
18
Knowledge-intensive reasoningBamboogle
F173.8
18
Knowledge-intensive reasoningMuSiQue
F1 Score34.8
18
Knowledge-intensive reasoning2WikiMultihopQA
F1 Score76.1
18
Showing 8 of 8 rows

Other info

Follow for update