MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models
About
Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Combat Tasks | MCU Mini | Success Rate49.3 | 6 | |
| Combat Tasks | MCU All set | Steps248 | 6 | |
| Embodied decision-making | Game for Peace 1.0 (test) | Parachuting Score71.4 | 6 | |
| Embodied Tasks | MCU Mini | SR38.5 | 6 | |
| Embodied Tasks | MCU All set | Steps263 | 6 | |
| GUI Tasks | MCU Mini set | Success Rate3.67e+3 | 5 | |
| GUI Tasks | MCU All set | Steps291 | 5 | |
| Embodied AI Gaming | Valorant (test) | Success Rate0.628 | 3 |