Towards a Human-like Open-Domain Chatbot
About
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | -- | 496 | |
| Visual Question Answering | GQA | Accuracy0.00e+0 | 374 | |
| Chart Question Answering | ChartQA | -- | 229 | |
| Diagram Question Answering | AI2D | AI2D Accuracy69.1 | 196 | |
| Optical Character Recognition Benchmarking | OCRBench | -- | 109 | |
| Visual Question Answering | COCO | Score6.2 | 21 | |
| OCR-based Visual Question Answering | OCRVQA | Mean Accuracy0.00e+0 | 13 | |
| Multi-modal Evaluation | MME-RW | Mean Accuracy27.6 | 10 | |
| Multi-modal Hallucination Evaluation | AMBER | Mean Accuracy53.5 | 10 | |
| Dialogue Evaluation | Human/Model Chats (test) | Engagement Score37 | 6 |