TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

About

We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $\pi_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io

William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tom\'as Lozano-P\'erez• 2026

Related benchmarks

Task	Dataset	Result
Multi-step manipulation	DROID Tabletop Multi-step tasks	Success Rate98	18
Rearrangement with distractors	DROID Tabletop Distractor tasks	Success Rate27	18
Semantic reasoning manipulation	DROID Tabletop Semantic tasks	Success Rate26	18
Pick-&-Place	DROID Tabletop Simple tasks	Success Rate22	12
Pick-&-Place	Simulation	Execution Time (s)17.9	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord