FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
About
Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to $50\%$ of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by $20\%$ through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across $190$ tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement96.9 | 700 | |
| Robotic Manipulation | LIBERO | Spatial Success Rate97.1 | 314 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate95.7 | 184 | |
| Robotic Manipulation | Calvin ABCD→D | Avg Length4.44 | 89 | |
| Robot Policy Learning | LIBERO | S (Spatial) Rate97.5 | 65 | |
| Robot Manipulation | Calvin ABC->D | Average Successful Length4.53 | 48 | |
| Robotic Manipulation | LIBERO (test) | Object Success Rate99.1 | 45 | |
| Robot Manipulation | SimplerEnv WidowX Robot tasks | Average Success Rate45 | 32 | |
| Robot Manipulation | Simpler-Bridge v1 (test) | Success Rate (Spoon)71 | 21 | |
| Robotic Manipulation | WidowX | Spoon Success Rate71 | 17 |