Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

About

Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.

Jinwoo Jang, Minjong Yoo, Sihyung Yoon, Honguk Woo• 2026

Related benchmarks

TaskDatasetResultRank
Instruction ExecutionVirtualHome (unseen domains)
Success Rate83.61
15
Embodied Task PlanningVirtualHome (Seen)
Simple Success83.61
10
Embodied Task PlanningALFWorld (seen domains)
Success Rate (SR)72.05
6
Embodied Task PlanningRLBench Seen domains
Success Rate71.89
6
Embodied Task PlanningVirtualHome (unseen domains)
Success Rate80.16
6
Embodied Task PlanningALFWorld (unseen domains)
Success Rate (SR)68.83
6
Embodied Task PlanningRLBench Unseen domains
Success Rate62.75
6
Few-shot task expansionVirtualHome unseen domains 1-shot
SR81.56
5
Few-shot task expansionVirtualHome unseen domains 5-shot
Success Rate83.61
5
Few-shot task expansionVirtualHome average performance (unseen domains)
SR82.59
5
Showing 10 of 16 rows

Other info

Follow for update