$\pi_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

About

Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $\pi_0$-EqM, which replaces the flow-matching expert in $\pi_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $\pi_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.

Huanming Liu, Congsheng Xu, Jianmin Ji, Yao Mu• 2026

Related benchmarks

Task	Dataset	Result	Rank
Robot Manipulation	LIBERO	Spatial Success Rate97.2		223

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord