Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Anti-exploration via VQVAE and Fuzzy Clustering in Offline Reinforcement Learning

About

Pseudo-count is an effective anti-exploration method in offline reinforcement learning (RL) by counting state-action pairs and imposing a large penalty on rare or unseen state-action pair data. Existing anti-exploration methods count continuous state-action pairs by discretizing these data, but often suffer from the issues of dimension disaster and information loss in the discretization process, leading to efficiency and performance reduction, and even failure of policy learning. In this paper, a novel anti-exploration method based on Vector Quantized Variational Autoencoder (VQVAE) and fuzzy clustering in offline RL is proposed. We first propose an efficient pseudo-count method based on the multi-codebook VQVAE to discretize state-action pairs, and design an offline RL anti-exploitation method based on the proposed pseudo-count method to handle the dimension disaster issue and improve the learning efficiency. In addition, a codebook update mechanism based on fuzzy C-means (FCM) clustering is developed to improve the use rate of vectors in codebooks, addressing the information loss issue in the discretization process. The proposed method is evaluated on the benchmark of Datasets for Deep Data-Driven Reinforcement Learning (D4RL), and experimental results show that the proposed method performs better and requires less computing cost in multiple complex tasks compared to state-of-the-art (SOTA) methods.

Long Chen, Yinkui Liu, Shen Li, Bo Tang, Xuemin Hu• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement Learninghopper medium
Normalized Score97.1
52
Offline Reinforcement Learningwalker2d medium
Normalized Score89.3
51
Offline Reinforcement Learningwalker2d medium-replay
Normalized Score96.3
50
Offline Reinforcement Learninghopper medium-replay
Normalized Score103.1
44
Offline Reinforcement Learninghalfcheetah medium
Normalized Score68.2
43
Offline Reinforcement Learninghalfcheetah medium-replay
Normalized Score62
43
Offline Reinforcement LearningMaze2D medium
Normalized Return179.2
38
Offline Reinforcement LearningMaze2D umaze
Normalized Return136.9
38
Offline Reinforcement LearningWalker2d medium-expert
Normalized Score112.8
31
Offline Reinforcement LearningHopper medium-expert
Normalized Score102.9
24
Showing 10 of 20 rows

Other info

Follow for update