PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
About
Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual promptaware encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-28.7% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO (test) | Average Success Rate86.7 | 184 | |
| Robotic Manipulation | LIBERO v1 (test) | Average Success Rate86.7 | 46 | |
| Pick Coke Can | SimplerEnv Google Robot setup | VM Success Rate81.7 | 13 | |
| Average (Overall Tasks) | SimplerEnv Google Robot setup | VM Success Rate63.3 | 13 | |
| Move Near | SimplerEnv Google Robot setup | VM Success Rate67.7 | 13 | |
| Open/Close Drawer | SimplerEnv Google Robot setup | VM Success Rate42.3 | 13 | |
| Robot Manipulation | SimplerEnv WidowX | Grasp Rate: Put Spoon51.7 | 11 |