Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

About

We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $\pi_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io

William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tom\'as Lozano-P\'erez• 2026

Related benchmarks

TaskDatasetResultRank
Multi-step manipulationDROID Tabletop Multi-step tasks
Success Rate98
18
Rearrangement with distractorsDROID Tabletop Distractor tasks
Success Rate27
18
Semantic reasoning manipulationDROID Tabletop Semantic tasks
Success Rate26
18
Pick-&-PlaceDROID Tabletop Simple tasks
Success Rate22
12
Pick-&-PlaceSimulation
Execution Time (s)17.9
4
Showing 5 of 5 rows

Other info

Follow for update