JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

About

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis.org/JARVIS-1

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang• 2023

Related benchmarks

Task	Dataset	Result
Embodied Agent Task Completion	MCU task suite	Wooden Success Rate (SR)93.57	17
Long-horizon Task Execution	Minecraft Long-horizon Tasks	Wood93	15
Multi-step dependency reasoning	Minecraft	WeaponSet Success@0→1037.5	11
Functionally equivalent reasoning	Minecraft	BridgeEq Success@0→1047.5	11
Structural and shape-based recipe transfer	Minecraft	Bed Success Rate (0->10 steps)60	11
Short-horizon dependency-based functional block utilization	Minecraft	CraftGrid Success@0→1052.5	11
Sequential Milestone Success Rate	Minecraft Obtain Diamond task	Log Success Rate97	8
Long-horizon tasks	Minecraft Iron	Success Rate (SR)42.38	7
Long-horizon tasks	Minecraft Gold	Success Rate (SR)8.84	7
Long-horizon tasks	Minecraft Diamond	Success Rate (SR)7.69	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord