MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

About

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	RLBench	Place Cups Success96	63
Robotic Manipulation	8 Real-world Tasks 20 repetitions (test)	Place Food Success Rate100	6
Robotic Manipulation	VIMABench	Simple Manipulation Score100	6
Robot Manipulation	RLBench	Basketball in Hoop Success Rate89	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord