OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
About
General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Error Detection and Recovery | Hotpot Robot Data (test) | Recovery Success Ratio5 | 3 | |
| Error Detection and Recovery | Robot Tasks Combined Total (test) | Successful Recoveries Count8 | 3 | |
| Visual Grounding | Single-Env (test) | Success Rate88 | 3 | |
| Visual Grounding | Open-World (test) | Success Rate73 | 3 | |
| Error Detection and Recovery | Tomato-Egg Robot Data (test) | Recovery Success Rate3 | 3 | |
| Human-Robot Interaction | HotPot | Successes10 | 2 | |
| Human-Robot Interaction | Cocktail | Successes10 | 2 | |
| Human-Robot Interaction | Hotpot and Cocktail Aggregate | Successes20 | 2 |