RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation

About

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

Hao Li, Ziqin Wang, Zi-han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, Jiangmiao Pang• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy83	1117
Object Hallucination Evaluation	POPE	Accuracy91.4	935
Multimodal Evaluation	MME	Score2.28e+3	557
Visual Grounding	RefCOCO+ (val)	Accuracy86.6	171
Visual Grounding	RefCOCO (val)	Accuracy91.5	119
Visual Grounding	RefCOCOg (val)	Accuracy88.4	93
Optical Character Recognition	OCRBench	--	83
Visual Question Answering	COCO	Score15.9	21
Close-loop Robotics Manipulation	SimplerEnv Google Robot	Drawer Open/Close Success (VM)73.6	13
Temporal Reasoning	RoboInter-VQA Temporal	Visual Trace81.9	13

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord