Transport Discrepancy as a Reliability Signal for Vision-Language-Action Models

About

Vision-language-action (VLA) models that generate continuous action chunks via flow matching lack an internal signal for judging whether a given prediction is reliable. Distribution shift and long-horizon rollouts can push backbone representations away from the region the action head decodes reliably, yet the policy has no mechanism to detect or react to this drift. We observe that the cost of transporting observation features to the action representation in a shared feature space rises precisely when such drift occurs, providing a per-step reliability estimate without extra supervision. Building on this observation, we propose DiG (Discrepancy Gate), a lightweight plug-in module for flow-matching VLA policies. DiG computes a sliced Wasserstein transport cost between backbone features and the action expert's own input projection, maps it through an exponential gate, and uses the gate to modulate both a residual feature refinement and the training loss. At inference time, the gate enables DiG-Refinefine, an iterative refinement process that corrects action chunks before execution. Experiments on both simulation and real-world scenarios show that DiG consistently improves success rates, with the largest gains under distribution shift and on long-horizon tasks.

Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Yicheng Feng, Chaoyi Xu, Sipeng Zheng, Qin Jin, Zongqing Lu• 2025

Related benchmarks

Task	Dataset	Result	Rank
Robot Manipulation	LIBERO (test)	Average Success Rate98.3		237
Robotic Manipulation	RoboCasa 24-task 50-demonstration protocol	Avg SR (24 Tasks)52.6		7

Showing 2 of 2 rows

Other info

GitHub

Follow for update

@wizwand_team Discord