OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
About
Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | LIBERO | Spatial Success Rate98 | 527 | |
| Robotic Manipulation | LIBERO-Plus | -- | 249 | |
| Robot Manipulation | SimplerEnv WidowX | Success Rate: Put Spoon on Towel76.8 | 98 | |
| Robotic Manipulation | MetaWorld MT50 | Success Rate (Easy, 28 Tasks)93.6 | 5 |