Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

About

Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.

Jinwoo Jang, Minjong Yoo, Sihyung Yoon, Honguk Woo• 2026

Related benchmarks

Task	Dataset	Result
Embodied Task Planning	VirtualHome (Seen)	--	18
Instruction Execution	VirtualHome (unseen domains)	Success Rate83.61	15
Embodied Task Planning	ALFWorld (seen domains)	Success Rate (SR)72.05	6
Embodied Task Planning	RLBench Seen domains	Success Rate71.89	6
Embodied Task Planning	VirtualHome (unseen domains)	Success Rate80.16	6
Embodied Task Planning	ALFWorld (unseen domains)	Success Rate (SR)68.83	6
Embodied Task Planning	RLBench Unseen domains	Success Rate62.75	6
Few-shot task expansion	VirtualHome unseen domains 1-shot	SR81.56	5
Few-shot task expansion	VirtualHome unseen domains 5-shot	Success Rate83.61	5
Few-shot task expansion	VirtualHome average performance (unseen domains)	SR82.59	5

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord