GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

About

We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

Hangyu Mao, Guangting Dong, Zhicheng Dou• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy92.2	1398
Mathematical Reasoning	MATH	Accuracy88.8	882
Mathematical Reasoning	AIME24	Accuracy30	160
Knowledge-intensive reasoning	MuSiQue	F1 Score34.8	43
Knowledge-intensive reasoning	HotpotQA	F1 Score0.654	41
Knowledge-intensive reasoning	Bamboogle	F173.8	23
Knowledge-intensive reasoning	WebWalker	F1 Score30.5	18
Knowledge-intensive reasoning	2WikiMultihopQA	F1 Score76.1	18

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord