Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory
About
Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 20 Questions | 20Q Common | Worst Case Interaction Length10 | 8 | |
| 20 Questions | 20Q S128 | Worst Case Interaction Length10.8 | 8 | |
| 20 Questions | 20Q Breeds | Worst Case Interaction Length6.6 | 8 | |
| Medical Diagnosis | MD DX | Worst Case Interaction Length10.5 | 8 | |
| Troubleshooting | TS FloDial | Worst Case Interaction Length7.5 | 8 | |
| Information Seeking | 20Q Breeds weighted (test) | Worst-case Weighted Payoff32.3 | 8 | |
| Information Seeking | 20Q Common weighted (test) | Worst-case Weighted Payoff152.1 | 8 | |
| Medical Diagnosis | MD DX weighted (test) | Worst-case Weighted Payoff78.3 | 8 | |
| Troubleshooting | TS FloDial weighted (test) | Worst-case Weighted Payoff62.3 | 8 |