DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames
About
We present Decentralized Distributed Proximal Policy Optimization (DD-PPO), a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale), making it conceptually simple and easy to implement. In our experiments on training virtual robots to navigate in Habitat-Sim, DD-PPO exhibits near-linear scaling -- achieving a speedup of 107x on 128 GPUs over a serial implementation. We leverage this scaling to train an agent for 2.5 Billion steps of experience (the equivalent of 80 years of human experience) -- over 6 months of GPU-time training in under 3 days of wall-clock time with 64 GPUs. This massive-scale training not only sets the state of art on Habitat Autonomous Navigation Challenge 2019, but essentially solves the task --near-perfect autonomous navigation in an unseen environment without access to a map, directly from an RGB-D camera and a GPS+Compass sensor. Fortuitously, error vs computation exhibits a power-law-like distribution; thus, 90% of peak performance is obtained relatively early (at 100 million steps) and relatively cheaply (under 1 day with 8 GPUs). Finally, we show that the scene understanding and navigation policies learned can be transferred to other navigation tasks -- the analog of ImageNet pre-training + task-specific fine-tuning for embodied AI. Our model outperforms ImageNet pre-trained CNNs on these transfer tasks and can serve as a universal resource (all models and code are publicly available).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| ObjectGoal Navigation | MP3D (val) | Success Rate8 | 68 | |
| Object Goal Navigation | HM3D-OVON Seen (val) | SR39.2 | 44 | |
| Object Goal Navigation | HM3D v1 (val) | Success Rate (SR)27.9 | 34 | |
| ObjectNav (Label goal) | Gibson tiny (test) | Success Rate13.9 | 20 | |
| ObjectNav | Gibson (val) | Success Rate15 | 18 | |
| System Throughput Measurement | Embodied Rearrangement open-fridge (train) | Mean SPS1.07e+3 | 16 | |
| Object Navigation | CoIN-Bench Seen Synonyms (val) | SPL11.7 | 13 | |
| Image-Goal Navigation | Gibson Curved trajectories (unseen) | Succ (Easy)22.2 | 12 | |
| Object Navigation | OVON unseen (val) | SR18.6 | 12 | |
| ObjectGoal Navigation | MP3D (test-std) | Success Rate0.062 | 11 |