Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments
About
Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long-horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation-based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction-aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long-run strategic value of actions with iso-grpo, a context-conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static-game rates obtained when strategic externalities are known. Experiments in 6-player No-Limit Texas Hold'em and competitive Pokemon show consistent improvements in long-term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Heads-Up No-Limit Texas Hold'em | Slumbot | -- | 9 | |
| 6-player No-Limit Texas Hold'em | 6-player No-Limit Texas Hold’em 10,000 hands | LTR15.8 | 7 | |
| Competitive Pokémon Play | Pokémon OU against GPT-4o Gen 1 (test) | Win Rate0.7 | 3 |