Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

About

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy83
1117
Object Hallucination EvaluationPOPE
Accuracy91.4
935
Multimodal EvaluationMME
Score2.28e+3
557
Visual GroundingRefCOCO+ (val)
Accuracy86.6
171
Visual GroundingRefCOCO (val)
Accuracy91.5
119
Visual GroundingRefCOCOg (val)
Accuracy88.4
93
Optical Character RecognitionOCRBench--
83
Visual Question AnsweringCOCO
Score15.9
21
Close-loop Robotics ManipulationSimplerEnv Google Robot
Drawer Open/Close Success (VM)73.6
13
Temporal ReasoningRoboInter-VQA Temporal
Visual Trace81.9
13
Showing 10 of 26 rows

Other info

Follow for update