| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Confidence Estimation | 20Q | Accuracy33.87 | 20 | |
| Multi-step interaction | 20Q | Winrate32.1 | 15 | |
| 20 Questions | 20Q Breeds | Worst Case Interaction Length6.6 | 8 | |
| 20 Questions | 20Q S128 | Worst Case Interaction Length10.8 | 8 | |
| 20 Questions | 20Q Common | Worst Case Interaction Length10 | 8 | |
| Information Seeking | 20Q Common weighted (test) | Worst-case Weighted Payoff235.7 | 8 | |
| Event Plausibility Prediction | 20Q (test) | AUC0.74 | 6 |