Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Wasserstein Barycenter Soft Actor-Critic

About

Deep off-policy actor-critic algorithms have emerged as the leading framework for reinforcement learning in continuous control domains. However, most of these algorithms suffer from poor sample efficiency, especially in environments with sparse rewards. In this paper, we take a step towards addressing this issue by providing a principled directed exploration strategy. We propose Wasserstein Barycenter Soft Actor-Critic (WBSAC) algorithm, which benefits from a pessimistic actor for temporal difference learning and an optimistic actor to promote exploration. This is achieved by using the Wasserstein barycenter of the pessimistic and optimistic policies as the exploration policy and adjusting the degree of exploration throughout the learning process. We compare WBSAC with state-of-the-art off-policy actor-critic algorithms and show that WBSAC is more sample-efficient on MuJoCo continuous control tasks.

Zahra Shahrooei, Ali Baheri• 2025

Related benchmarks

TaskDatasetResultRank
Ball In Cup CatchDeepMind Control suite
Average Return983.9
4
Cheetah RunDeepMind Control suite
Average Return740.2
4
Continuous ControlMuJoCo HalfCheetah v5 (test)
Average Return6.47e+3
4
Continuous ControlMuJoCo Walker2d v5 (test)
Average Return4.42e+3
4
Continuous ControlMuJoCo Humanoid v5 (test)
Average Return5.70e+3
4
Finger Turn hardDeepMind Control suite
Average Return911
4
Finger-Turn EasyDeepMind Control suite
Average Return944
4
Hopper HopDeepMind Control suite
Average Return130.1
4
Humanoid RunDeepMind Control suite
Average Return144.1
4
Walker RunDeepMind Control suite
Average Return684.3
4
Showing 10 of 12 rows

Other info

Follow for update