PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

About

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs• 2024

Related benchmarks

Task	Dataset	Result
Active visual tracking	EVT-Bench Single Target single-view	TR15.5	11
Embodied Visual Tracking	EVT-Bench Single Target Tracking	SR4.67	11
Embodied Visual Tracking	EVT-Bench Distracted Tracking	SR2.62	11
Embodied Visual Tracking	EVT-Bench	ST Success Rate (SR)4.7	10
Safety-ObjNav	Safety-CHORES (test)	Success Rate80.4	9
Person-Following	EVT-Bench Single-Target Tracking (STT) single view	SR4.67	9
Person-Following	EVT-Bench single view (Distracted Tracking)	SR2.62	9
Person-Following	EVT-Bench Ambiguity Tracking (AT) single view	Success Rate (SR)3.04	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord