Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
About
Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves excellent performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL hopper-expert v2 | Normalized Score111.6 | 56 | |
| Offline Reinforcement Learning | D4RL halfcheetah-expert v2 | Normalized Score92 | 56 | |
| Offline Reinforcement Learning | D4RL walker2d-expert v2 | Normalized Score108.1 | 56 | |
| Offline Reinforcement Learning | D4RL antmaze-umaze (diverse) | Normalized Score70 | 40 | |
| Offline Reinforcement Learning | D4RL AntMaze-Umaze v0 | Average Normalized Score80 | 5 | |
| Offline Reinforcement Learning | D4RL Ant Medium-Replay v2 | Normalized Score92.7 | 4 | |
| Offline Reinforcement Learning | D4RL Ant Medium-Expert v2 | Normalized Score134.8 | 4 | |
| Offline Reinforcement Learning | D4RL Ant-Expert v2 | Normalized Score134.2 | 4 | |
| Offline Reinforcement Learning | D4RL Ant-Medium v2 | Normalized Score94.6 | 4 |