Modular Framework for Visuomotor Language Grounding
About
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.
Kolby Nottingham, Litian Liang, Daeyun Shin, Charless C. Fowlkes, Roy Fox, Sameer Singh• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | ALFRED (test-unseen) | GC17.27 | 23 | |
| Embodied Instruction Following | ALFRED seen 1.0 (test) | GC23.21 | 20 | |
| Embodied Task Completion | ALFRED unseen (test) | Success Rate6.38 | 14 |
Showing 3 of 3 rows