Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Can We Really Learn One Representation to Optimize All Rewards?

About

As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB's training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24\%$ on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.

Chongyi Zheng, Royina Karegoudra Jayanth, Benjamin Eysenbach• 2026

Related benchmarks

TaskDatasetResultRank
Goal ReachingAntMaze Medium navigate-v0 (test-time goals)
Success Rate93
10
Goal ReachingAntMaze Large test-time goals navigate-v0
Success Rate73
10
Goal ReachingAntMaze Giant navigate test-time goals v0
Average Success Rate5
10
Goal ReachingAntMaze Teleport navigate-v0 (test-time goals)
Average Success Rate42
10
Goal-conditioned Reinforcement LearningOGBench scene play (5 tasks) zero-shot
Average Return16
10
Goal Reachingantmaze medium-navigate v0
Success Rate93
8
Goal Reachingantmaze large-navigate v0
Success Rate73
8
Goal Reachingantmaze giant-navigate v0
Success Rate5
8
Goal Reachingantmaze teleport-navigate v0
Success Rate42
8
Unsupervised Reinforcement LearningExORL cheetah (4 tasks) zero-shot
Average Return378
6
Showing 10 of 19 rows

Other info

Follow for update