Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

COVR:Collaborative Optimization of VLMs and RL Agent for Visual-Based Control

About

Visual reinforcement learning (RL) suffers from poor sample efficiency due to high-dimensional observations in complex tasks. While existing works have shown that vision-language models (VLMs) can assist RL, they often focus on knowledge distillation from the VLM to RL, overlooking the potential of RL-generated interaction data to enhance the VLM. To address this, we propose COVR, a collaborative optimization framework that enables the mutual enhancement of the VLM and RL policies. Specifically, COVR fine-tunes the VLM with RL-generated data to enhance the semantic reasoning ability consistent with the target task, and uses the enhanced VLM to further guide policy learning via action priors. To improve fine-tuning efficiency, we introduce two key modules: (1) an Exploration-Driven Dynamic Filter module that preserves valuable exploration samples using adaptive thresholds based on the degree of exploration, and (2) a Return-Aware Adaptive Loss Weight module that improves the stability of training by quantifying the inconsistency of sampling actions via return signals of RL. We further design a progressive fine-tuning strategy to reduce resource consumption. Extensive experiments show that COVR achieves strong performance across various challenging visual control tasks.

Canming Xia, Peixi Peng, Guang Tan, Zhan Su, Haoran Xu, Zhenxian Liu, Luntong Li• 2026

Related benchmarks

TaskDatasetResultRank
Visual Reinforcement LearningDMControl Cartpole, Swingup
Episode Return872
16
Visual Reinforcement LearningDMControl Reacher Easy
Episode Return969
16
Visual Reinforcement LearningDMControl Cheetah Run
Episode Return504
16
Visual Reinforcement LearningDMControl Walker Walk
Episode Return802
16
Visual Reinforcement LearningDMControl Finger, Spin
Episode Return976
16
Visual Reinforcement LearningDMControl Ball in cup, Catch
Episode Return960
16
Autonomous DrivingCARLA (#HW)
Error Rate248
15
Visual Reinforcement LearningCARLA (#GP scenario)
ER235
15
Visual Reinforcement LearningCarRacing v0 (test)
Environment Reward6.19e+5
11
Visual Reinforcement LearningCARLA Scenario A (test)
ER47
6
Showing 10 of 14 rows

Other info

Follow for update