Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Revisiting the Learning Objectives of Vision-Language Reward Models

About

Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.

Simon Roy, Samuel Barbeau, Giovanni Beltrame, Christian Desrosiers, Nicolas Thome• 2025

Related benchmarks

TaskDatasetResultRank
Open DoorMeta-World
VOC Score62.64
35
Button pressMeta-World
VOC Score95.47
28
open drawerMeta-World
VOC Score90.17
28
Reward ModelingMeta-World Button press
Prediction Accuracy76.44
28
Reward ModelingMeta-World Open drawer
Prediction Accuracy69.01
28
Reward ModelingMeta-World Open door
Prediction Accuracy65.02
28
Showing 6 of 6 rows

Other info

Follow for update