Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

About

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy91.4
2019
Visual Question AnsweringTextVQA
Accuracy83
1453
Multimodal EvaluationMME
Score2.28e+3
727
Optical Character RecognitionOCRBench
Score832
433
Visual GroundingRefCOCO+ (val)
Accuracy86.6
253
Visual GroundingRefCOCO (val)
Accuracy91.5
172
Visual GroundingRefCOCOg (val)
Accuracy88.4
158
Visual Question AnsweringCOCO
Score15.9
106
Temporal Task PlanningRoboVQA
Score74.5
20
Embodied Spatial Point ReasoningWhere2Place
Accuracy66.3
19
Showing 10 of 26 rows

Other info

Follow for update