20Q

Benchmarks

Task Name	Dataset Name	SOTA Result
Confidence Estimation	20Q	Accuracy33.87	20
Multi-step interaction	20Q	Winrate32.1	15
20 Questions	20Q Breeds	Worst Case Interaction Length6.6	8
20 Questions	20Q S128	Worst Case Interaction Length10.8	8
20 Questions	20Q Common	Worst Case Interaction Length10	8
Information Seeking	20Q Common weighted (test)	Worst-case Weighted Payoff235.7	8
Event Plausibility Prediction	20Q (test)	AUC0.74	6

Showing 7 of 7 rows