Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

About

While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng• 2025

Related benchmarks

Task	Dataset	Result
Interactive Decision-making	AlfWorld	Overall Success Rate75.4	398
Interactive Decision-making	ScienceWorld	Success Rate60.48	78
Interactive Decision-making	ScienceWorld Seen	Success Rate77.43	72
Interactive Decision-making	ALFWorld Unseen	Success Rate75.38	67
Interactive Decision-making	TextCraft	Success Rate28.5	60
Interactive Decision-making	ALFWorld Seen	Success Rate72.12	47
Interactive Decision-making	ScienceWorld Unseen	Success Rate68.34	32
Interactive Decision-making	ALFWorld OOD	Success Rate45.71	18
Decision-making in interactive environments	ScienceWorld Llama-3.1 8B backbone (Out-of-Distribution (OOD))	Performance34.36	6
Decision-making in interactive environments	ScienceWorld Llama-3.1 8B backbone (In-Distribution (ID))	Performance60.48	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord