TD-MPC2: Scalable, Robust World Models for Continuous Control
About
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Locomotion | Dog & Humanoid suite | IQM0.527 | 32 | |
| Humanoid Locomotion and Manipulation | HumanoidBench | IQM0.734 | 28 | |
| Dexterous Manipulation | MyoSuite | IQM0.775 | 28 | |
| Continuous Control | HumanoidBench No Hand | Total Reward580 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite 200k steps | IQM37.4 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite 500k steps | IQM56.6 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite (100k steps) | IQM0.152 | 8 | |
| Continuous Control | DeepMind Control (DMC) Suite (1M steps) | IQM69.6 | 8 | |
| Continuous Control | DeepMind Control Suite (DMC) | Total Reward0.78 | 8 | |
| Continuous Control | HumanoidBench Hand | Total Reward220 | 8 |