| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Audio Retrieval | VALOR | Recall@136.4 | 24 | |
| Text-to-Video Retrieval | VALOR-32K | Recall@180 | 18 | |
| Zero-shot Retrieval (T+V → A) | VALOR | Recall@178.8 | 14 | |
| Zero-shot Retrieval (T+A → V) | VALOR | Recall@193 | 14 | |
| Zero-shot Retrieval (T → A+V) | VALOR | Recall@176.9 | 14 | |
| Audio-Visual Question Answering | VALOR (test) | M.J. Score44.67 | 12 | |
| Audio-to-Text Retrieval | VALOR | Recall@135.1 | 9 | |
| Captioning | VALOR 32K | CIDEr62.8 | 9 | |
| Text-to-audiovisual Retrieval | VALOR-32K (test) | Recall@180.9 | 7 | |
| Audio-Visual Captioning | VALOR 32K (val) | BLEU@416.88 | 7 | |
| Retrieval | VALOR T+V -> A | Recall@178.8 | 6 | |
| Retrieval | VALOR T+A -> V | Recall@193 | 6 | |
| Retrieval | VALOR T -> A+V | Recall@176.8 | 6 | |
| Audiovisual Captioning | VALOR-32K | B@49.6 | 5 | |
| Audio-Visual Question Answering | VALOR (test) | CIDEr62.2 | 5 | |
| Audio-visual captioning | VALOR-32K (test) | CIDEr62.2 | 4 | |
| Audio-Visual Question Answering | VALOR | M.J. Score56.53 | 3 | |
| Text-to-Video-Audio Retrieval | VALOR-32K | Recall@178.7 | 2 |