OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

About

Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.

Kuanning Wang, Ke Fan, Chenhao Qiu, Zeyu Shangguan, Yuqian Fu, Yanwei Fu, Daniel Seita, Xiangyang Xue• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate98	570
Robotic Manipulation	LIBERO-Plus	--	414
Robot Manipulation	SimplerEnv WidowX	Overall Success Rate67.1	123
Robotic Manipulation	MetaWorld MT50	Success Rate (Easy, 28 Tasks)93.6	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord