MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

About

Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.

Zheyuan Zhou, Liang Du, Zixun Sun, Xiaoyu Zhou, Ruimin Ye, Qihao Chen, Yinda Chen, Lemiao Qiu• 2026

Related benchmarks

Task	Dataset	Result
Combat Tasks	MCU All set	Steps248	15
Embodied Tasks	MCU All set	Steps263	15
GUI Tasks	Minecraft MCU	ASR34.4	9
Combat Tasks	MCU Mini	Success Rate49.3	6
Embodied decision-making	Game for Peace 1.0 (test)	Parachuting Score71.4	6
Embodied Tasks	MCU Mini	SR38.5	6
GUI Tasks	MCU Mini set	Success Rate3.67e+3	5
GUI Tasks	MCU All set	Steps291	5
Embodied AI Gaming	Valorant (test)	Success Rate0.628	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord