Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

About

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy91.4
1455
Visual Question AnsweringTextVQA
Accuracy83
1285
Multimodal EvaluationMME
Score2.28e+3
658
Optical Character RecognitionOCRBench
Score832
232
Visual GroundingRefCOCO+ (val)
Accuracy86.6
212
Visual GroundingRefCOCO (val)
Accuracy91.5
147
Visual GroundingRefCOCOg (val)
Accuracy88.4
114
Visual Question AnsweringCOCO
Score15.9
106
Embodied Spatial Point ReasoningWhere2Place
Accuracy66.3
19
Close-loop Robotics ManipulationSimplerEnv Google Robot
Drawer Open/Close Success (VM)73.6
13
Showing 10 of 26 rows

Other info

Follow for update