| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Dialogue | AVSD DSTC8 (test) | BLEU-447.5 | 24 | |
| Audio-Visual Scene-Aware Dialog | AVSD (test) | CIDEr1.605 | 11 | |
| Audio-Video Understanding | AVSD (test) | Accuracy62.8 | 9 | |
| Audio-Visual Scene-aware Dialog | AVSD (val) | ASR (%)59.48 | 7 | |
| Open-Ended Audio-Video QA | AVSD | Accuracy57.2 | 7 | |
| Audio-Visual Question Answering | AVSD 1 (test) | CIDEr152.9 | 6 | |
| Audio-Visual Question Answering | AVSD | Accuracy54.8 | 6 | |
| Video Dialogue | AVSD DSTC7 (test) | BLEU-178.9 | 6 | |
| Video Dialogue | AVSD DSTC10 (test) | CIDEr103.3 | 6 | |
| Video Dialog | AVSD DSTC7 | BLEU-155.5 | 6 | |
| Video Dialog | AVSD DSTC10 | BLEU-10.546 | 6 | |
| Audio-Visual Question Answering | AVSD (test) | CIDEr108.5 | 6 | |
| Response Generation | AVSD | CIDEr85.1 | 4 |