Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

About

Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.

Zheyuan Zhou, Liang Du, Zixun Sun, Xiaoyu Zhou, Ruimin Ye, Qihao Chen, Yinda Chen, Lemiao Qiu• 2026

Related benchmarks

TaskDatasetResultRank
Combat TasksMCU Mini
Success Rate49.3
6
Combat TasksMCU All set
Steps248
6
Embodied decision-makingGame for Peace 1.0 (test)
Parachuting Score71.4
6
Embodied TasksMCU Mini
SR38.5
6
Embodied TasksMCU All set
Steps263
6
GUI TasksMCU Mini set
Success Rate3.67e+3
5
GUI TasksMCU All set
Steps291
5
Embodied AI GamingValorant (test)
Success Rate0.628
3
Showing 8 of 8 rows

Other info

Follow for update