Vision Transformers for End-to-End Vision-Based Quadrotor Obstacle Avoidance
About
We demonstrate the capabilities of an attention-based end-to-end approach for high-speed vision-based quadrotor obstacle avoidance in dense, cluttered environments, with comparison to various state-of-the-art learning architectures. Quadrotor unmanned aerial vehicles (UAVs) have tremendous maneuverability when flown fast; however, as flight speed increases, traditional model-based approaches to navigation via independent perception, mapping, planning, and control modules breaks down due to increased sensor noise, compounding errors, and increased processing latency. Thus, learning-based, end-to-end vision-to-control networks have shown to have great potential for online control of these fast robots through cluttered environments. We train and compare convolutional, U-Net, and recurrent architectures against vision transformer (ViT) models for depth image-to-control in high-fidelity simulation, observing that ViT models are more effective than others as quadrotor speeds increase and in generalization to unseen environments, while the addition of recurrence further improves performance while reducing quadrotor energy cost across all tested flight speeds. We assess performance at speeds of up to 7m/s in simulation and hardware. To the best of our knowledge, this is the first work to utilize vision transformers for end-to-end vision-based quadrotor control.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hovering Maintenance | Urban Street | SR0.00e+0 | 24 | |
| Hovering Maintenance | Park | Success Rate0.00e+0 | 24 | |
| Dynamic Target Following | Forest 1.5 m/s target speed | Success Rate (SR)0.00e+0 | 6 | |
| Dynamic Target Following | Factory 3.0 m/s target speed | Success Rate0.00e+0 | 6 | |
| Fixed-trajectory filming | Park scene 3.0 m/s obstacle speed | Success Rate (SR)0.00e+0 | 6 | |
| Fixed-trajectory filming | Park scene 6.0 m/s obstacle speed | Success Rate0.00e+0 | 6 | |
| Fixed-trajectory filming | Forest scene 3.0 m/s obstacle speed | Success Rate0.00e+0 | 6 | |
| Fixed-trajectory filming | Forest scene 6.0 m/s obstacle speed | Success Rate (SR)0.00e+0 | 6 | |
| Dynamic Target Following | Factory 1.5 m/s target speed | Success Rate0.00e+0 | 6 | |
| Dynamic Target Following | Forest 3.0 m/s target speed | Success Rate0.00e+0 | 6 |