Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DyBBT: Dynamic Balance via Bandit-inspired Targeting for Dialog Policy with Cognitive Dual-Systems

About

Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment.

Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li, Yujie Liu• 2025

Related benchmarks

TaskDatasetResultRank
Goal-oriented dialogueMovie
Success Rate89.52
44
Goal-oriented dialogueRestaurant
Success Rate83.38
34
Goal-oriented dialoguetaxi
Success Rate85.53
32
Multi-domain DialogMultiWOZ
Success Rate85.3
13
Multi-domain DialogMultiWOZ Real World User Experiment
Success Rate84.7
4
Showing 5 of 5 rows

Other info

Follow for update