Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GTO Wizard Benchmark

About

We introduce GTO Wizard Benchmark, a public API and standardized evaluation framework for benchmarking algorithms in Heads-Up No-Limit Texas Hold'em (HUNL). The benchmark evaluates agents against GTO Wizard AI, a state-of-the-art superhuman poker agent that approximates Nash Equilibria, and defeated Slumbot, the 2018 Annual Computer Poker Competition champion and previous strongest publicly accessible HUNL benchmark, by $19.4$ $\pm$ $4.1$ bb/100. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT, a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Initial results and analysis reveal dramatic progress in LLM reasoning over recent years, yet all models remain far below the baseline established by our benchmark. Qualitative analysis reveals clear opportunities for improvement, including representation and the ability to reason over hidden states. This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.

Marc-Antoine Provost, Nejc Ilenic, Christopher Solinas, Philippe Beardsell• 2026

Related benchmarks

TaskDatasetResultRank
Heads-Up No-Limit Texas Hold'emGTOWizard HUNL
BB/h223
11
Heads-Up No-Limit Hold'em (HUNL)GTOWizard benchmark
mbb/h132
4
Showing 2 of 2 rows

Other info

Follow for update