RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

About

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the {\pi}0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai• 2026

Related benchmarks

Task	Dataset	Result
Permanence	RoboMME Permanence	Video Umsk Success Rate98.78	23
Counting	RoboMME Counting	Bin Fill Success Rate85.78	23
Imitation	RoboMME Imitation	Move Cube Success Rate87.78	23
Reference	RoboMME Reference	Pick HighL Success Rate83.33	23
Robotic Memory Manipulation	RoboMME Overall	Average Success Rate84.08	23
Robotic Generalist Policy Execution	MME-VLA 1.0 (test)	Counting Score83.86	21
Robotic Manipulation	RoboMME Real-world	Put Fruits Success Rate9	3

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord