Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

About

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on the joint strategies of all agents. In multi-agent games, the non-stationarity of other agents brings significant challenges on the evaluation of the reasoning process and the credit assignment over multiple reasoning steps. Existing single-agent reinforcement learning (RL) approaches and their multi-agent extensions fail to address these challenges as they do not incorporate other agents in the reasoning process. In this work, we propose Strat-Reasoner, a novel RL-based framework that improves LLMs' strategic reasoning ability in multi-agent games. We introduce a novel recursive reasoning paradigm where an agent's reasoning also integrates other agents' reasoning processes. To provide effective reward signals for the intermediate reasoning sequences, we employ a centralized Chain-of-Thought (CoT) comparison module to evaluate the reasoning quality. Finally, we compute an accurate hybrid advantage and develop a group-relative RL approach to optimize the LLM policy. Experimental results show that Strat-Reasoner substantially improves strategic abilities of underlying LLMs, achieving 22.1\% average performance improvements across various multi-agent games. Code is publicly available at https://github.com/ydhe1012/Strat-Reasoner.

Yidong He, Yutao Lai, Pengxu Yang, Jiarui Gan, Jiexin Wang, Yi Cai, Mengchen Zhao• 2026

Related benchmarks

Task	Dataset	Result
Multi-Agent Strategic Reasoning	ConnectFour OOD	First-mover Normalized Score75.93	18
Multi-Agent Game	Tic-Tac-Toe vs. MCTS Bot, 100 sims	First-move Normalized Score90.77	9
Multi-Agent Game	KuhnPoker vs. NE Bot	Normalized Score (First Move)94.04	9
Multi-Agent Game	MiniHanabi Co-op	Average Normalized Game Score80.19	9
Multi-Agent Game	Tic-Tac-Toe vs. MCTS Bot, 1000 sims	First-move Normalized Score77.6	9
Multi-Agent Strategic Reasoning	LeducHoldem OOD	First-mover Normalized Score70.12	9
Multi-Agent Strategic Reasoning	SimpleHanabi OOD	Collective Avg Normalized Score68.63	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord