Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generative Actor Critic

About

Conventional Reinforcement Learning (RL) algorithms, typically focused on estimating or maximizing expected returns, face challenges when refining offline pretrained models with online experiences. This paper introduces Generative Actor Critic (GAC), a novel framework that decouples sequential decision-making by reframing \textit{policy evaluation} as learning a generative model of the joint distribution over trajectories and returns, $p(\tau, y)$, and \textit{policy improvement} as performing versatile inference on this learned model. To operationalize GAC, we introduce a specific instantiation based on a latent variable model that features continuous latent plan vectors. We develop novel inference strategies for both \textit{exploitation}, by optimizing latent plans to maximize expected returns, and \textit{exploration}, by sampling latent plans conditioned on dynamically adjusted target returns. Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods, even in absence of step-wise rewards.

Aoyang Qin, Deqian Kong, Wei Wang, Ying Nian Wu, Song-Chun Zhu, Sirui Xie• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Walker2d Medium v2
Normalized Return80.2
67
Offline Reinforcement LearningD4RL Hopper-medium-replay v2
Normalized Return80.1
54
Offline Reinforcement LearningD4RL Hopper Medium v2
Normalized Return73.2
43
Offline Reinforcement LearningD4RL HalfCheetah Medium v2
Average Normalized Return43.6
43
Offline Reinforcement LearningD4RL HalfCheetah Med-Replay v2
Avg Normalized Return39.8
29
Offline Reinforcement LearningD4RL Walker-medium-replay v2
Normalized Return78.9
25
Online Fine-tuningD4RL MuJoCo and Maze2D online fine-tuning v2 v0
Normalized Return93.3
14
Offline Reinforcement LearningD4RL Maze2D Umaze v1
Avg Normalized Return67.8
9
Offline Reinforcement LearningD4RL Maze2D Medium v1
Average Normalized Return74.5
9
Offline Reinforcement LearningD4RL Maze2D Large v1
Avg Normalized Return50.3
9
Showing 10 of 12 rows

Other info

Follow for update