PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm
About
Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Control | MuJoCo Hopper3d | UT Score1.29 | 11 | |
| Continuous Control | MuJoCo Ant3d | UT1.29 | 11 | |
| Continuous Control | MuJoCo Halfcheetah2d | UT Score3.17 | 11 | |
| Continuous Control | MuJoCo Humanoid5d | Undiscounted Return (UT)0.38 | 11 | |
| Continuous Control | MuJoCo Walker2d | Uncertainty Time (UT)1.7 | 11 | |
| Continuous Control | MuJoCo Humanoid2d | UT Score-0.05 | 11 | |
| Multi-objective Reinforcement Learning | Deep Sea Treasure | Hypervolume (HV)9.33 | 10 | |
| Multi-objective Reinforcement Learning | MuJoCo 8 continuous-action tasks MO-Gymnasium (aggregated) | Hypervolume (HV)3.25 | 7 | |
| Multi-objective Reinforcement Learning | Fruit Tree Navigation | UT5.03 | 7 |