Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments
About
Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Execution | VirtualHome (unseen domains) | Success Rate83.61 | 15 | |
| Embodied Task Planning | VirtualHome (Seen) | Simple Success83.61 | 10 | |
| Embodied Task Planning | ALFWorld (seen domains) | Success Rate (SR)72.05 | 6 | |
| Embodied Task Planning | RLBench Seen domains | Success Rate71.89 | 6 | |
| Embodied Task Planning | VirtualHome (unseen domains) | Success Rate80.16 | 6 | |
| Embodied Task Planning | ALFWorld (unseen domains) | Success Rate (SR)68.83 | 6 | |
| Embodied Task Planning | RLBench Unseen domains | Success Rate62.75 | 6 | |
| Few-shot task expansion | VirtualHome unseen domains 1-shot | SR81.56 | 5 | |
| Few-shot task expansion | VirtualHome unseen domains 5-shot | Success Rate83.61 | 5 | |
| Few-shot task expansion | VirtualHome average performance (unseen domains) | SR82.59 | 5 |