D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
About
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation, matching or surpassing models up to 7x larger, such as \pi_{0} (3.3B) and OpenVLA (7B). These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI. All resources are publicly available at https://worv-ai.github.io/d2e.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement98.6 | 700 | |
| Robotic Manipulation | Meta-World | Success Rate (Easy)53.6 | 16 | |
| Pick-&-Place | SO101 | Pick & Place Success Rate80 | 6 | |
| Video gameplay action prediction | Battlefield 6 | Pearson Correlation (X)57.36 | 4 | |
| Video gameplay action prediction | Ogu Forest | Keypress Accuracy (Keyboard)27.97 | 4 | |
| Robot navigation | CANVAS | Gallery Miss Rate33.3 | 3 | |
| Action Prediction | Brotato 2D in-distribution | Pearson Correlation X73.65 | 2 | |
| Action Prediction | Stardew Valley 2D in-distribution | Pearson Correlation Coefficient X82.98 | 2 | |
| Action Prediction | Core Keeper 2D in-distribution | Pearson Correlation (X)77.25 | 2 | |
| Inverse Dynamics Modeling | Apex Legends | Pearson Correlation (X)83.9 | 2 |