Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

About

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation, matching or surpassing models up to 7x larger, such as \pi_{0} (3.3B) and OpenVLA (7B). These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI. All resources are publicly available at https://worv-ai.github.io/d2e.

Suhwan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement98.6
700
Robotic ManipulationMeta-World
Success Rate (Easy)53.6
16
Pick-&-PlaceSO101
Pick & Place Success Rate80
6
Video gameplay action predictionBattlefield 6
Pearson Correlation (X)57.36
4
Video gameplay action predictionOgu Forest
Keypress Accuracy (Keyboard)27.97
4
Robot navigationCANVAS
Gallery Miss Rate33.3
3
Action PredictionBrotato 2D in-distribution
Pearson Correlation X73.65
2
Action PredictionStardew Valley 2D in-distribution
Pearson Correlation Coefficient X82.98
2
Action PredictionCore Keeper 2D in-distribution
Pearson Correlation (X)77.25
2
Inverse Dynamics ModelingApex Legends
Pearson Correlation (X)83.9
2
Showing 10 of 12 rows

Other info

GitHub

Follow for update